Kubernetes Upgrade fails with timeout

What the heck? The latest upgrade procedure of my Kubernetes cluster gave me headaches. Not only because it failed with a timeout – mainly because the root cause was not obvious. In fact, the maintainers of Kubernetes made an infrastructure change long time ago but forgot to properly communicate to their users.

But before we start the rant, let’s check what happened – I tried to upgrade from v1.18.2 to v1.18.14. This happened:

timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests

So I started to re-run the upgrade with verbosity on. Nothing more information. What I saw was that the kube-apiserver won’t come up – no log file gave a reason why this could happen. I asked Google – very little information but one hint – the image pull could have been failing.

Another search revealed that Kubernetes maintainers changed their repository from gcr.io/google_containers to k8s.gcr.io – presumingly long time ago. And checking my cluster more thoroughly I found out that the old repository was being used. But why was my cluster not knowing the new one? I was upgrading each major version – since the beginning.

Next search for the information on how to change it – nothing on Kubernetes docs (WTF!) but in some change request. You need to see the kubeadm-config ConfigMap in your kube-system namespace. There you’ll find the repository address. Changing this to the correct name finally did the trick and the upgrade succeeded.

But the more I think about this challenge, the more angry I get.

How can such an essential change not be communicated more prominently – especially since the old repository was abandoned with v1.18.6 – the last image version in the old repo. Every upgrade document sind 1.18 must have a warning that the old repo is out of order now and a link to the change procedure
Why is the error message not telling anything useful? The stacktrace is useless for the information about what happened.
And why – for God’s sake – does the upgrade procedure itself not check for this essential change? Especially since v1.18.7.

This way of maintaining software is a very unprofessional one. Kubernetes is the foundation of so many productive systems now that this essential change must be taken more seriously by the maintainers. Breaking the procedures is a danger to all these systems and a proper communication or risk mitigation is not in place.

I need to stress out that upgrading Kubernetes is always risky. I experienced so many issues in the past that blocked an upgrade. Most of them were better documented and so I could resolve them. But this infrastructure change is a sign of unprofessional risk management. And I hope they will do much better next time.

I have noticed a new ABEND category for possible catastrophe during maintence. Basicly ideas that never would have made it through change control, those gutless bastards who never get real work done.

Server or Netwrok death by misadventure?
In the United Kingdom, death by misadventure is the recorded manner of death for an accidental death caused by a risk taken voluntarily.[1]

Misadventure in English law, as recorded by coroners and on death certificates and associated documents, is a death that is primarily attributed to an accident that occurred due to a risk that was taken voluntarily. In contrast, when the manner of death is given as an accident, the coroner has determined that the decedent had taken no unreasonable willful risk.[1]

“Misadventure may be the right conclusion when a death arises from some deliberate human act which unexpectedly and unintentionally goes wrong.”[2]

Legally defined manner of death: a way by which an actual cause of death (trauma, exposure, etc.) was allowed to occur. For example, a death caused by an illicit drug overdose may be ruled a death by misadventure, as the user took the risk of drug usage voluntarily. Misadventure is a form of unnatural death, a category that also includes accidental death, suicide, and homicide.[1]

In the case of R v Wolverhampton Coroner,[3] it was held that the coroner must establish death by misadventure on the balance of probabilities or commonly known as “more likely than not”. This is opposed to beyond reasonable doubt, which is used elsewhere.[4]

One reply on “Kubernetes Upgrade fails with timeout”

Leave a Reply