timeout – Ralph's TechBlog

What the heck? The latest upgrade procedure of my Kubernetes cluster gave me headaches. Not only because it failed with a timeout – mainly because the root cause was not obvious. In fact, the maintainers of Kubernetes made an infrastructure change long time ago but forgot to properly communicate to their users.

But before we start the rant, let’s check what happened – I tried to upgrade from v1.18.2 to v1.18.14. This happened:

timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests

So I started to re-run the upgrade with verbosity on. Nothing more information. What I saw was that the kube-apiserver won’t come up – no log file gave a reason why this could happen. I asked Google – very little information but one hint – the image pull could have been failing.

Another search revealed that Kubernetes maintainers changed their repository from gcr.io/google_containers to k8s.gcr.io – presumingly long time ago. And checking my cluster more thoroughly I found out that the old repository was being used. But why was my cluster not knowing the new one? I was upgrading each major version – since the beginning.

Next search for the information on how to change it – nothing on Kubernetes docs (WTF!) but in some change request. You need to see the kubeadm-config ConfigMap in your kube-system namespace. There you’ll find the repository address. Changing this to the correct name finally did the trick and the upgrade succeeded.

But the more I think about this challenge, the more angry I get.

How can such an essential change not be communicated more prominently – especially since the old repository was abandoned with v1.18.6 – the last image version in the old repo. Every upgrade document sind 1.18 must have a warning that the old repo is out of order now and a link to the change procedure
Why is the error message not telling anything useful? The stacktrace is useless for the information about what happened.
And why – for God’s sake – does the upgrade procedure itself not check for this essential change? Especially since v1.18.7.

This way of maintaining software is a very unprofessional one. Kubernetes is the foundation of so many productive systems now that this essential change must be taken more seriously by the maintainers. Breaking the procedures is a danger to all these systems and a proper communication or risk mitigation is not in place.

I need to stress out that upgrading Kubernetes is always risky. I experienced so many issues in the past that blocked an upgrade. Most of them were better documented and so I could resolve them. But this infrastructure change is a sign of unprofessional risk management. And I hope they will do much better next time.