Categories
Docker Kubernetes Kubernetes

Kubernetes Upgrade fails with timeout

What the heck? The latest upgrade procedure of my Kubernetes cluster gave me headaches. Not only because it failed with a timeout – mainly because the root cause was not obvious. In fact, the maintainers of Kubernetes made an infrastructure change long time ago but forgot to properly communicate to their users.

But before we start the rant, let’s check what happened – I tried to upgrade from v1.18.2 to v1.18.14. This happened:

timed out waiting for the condition
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests

So I started to re-run the upgrade with verbosity on. Nothing more information. What I saw was that the kube-apiserver won’t come up – no log file gave a reason why this could happen. I asked Google – very little information but one hint – the image pull could have been failing.

Another search revealed that Kubernetes maintainers changed their repository from gcr.io/google_containers to k8s.gcr.io – presumingly long time ago. And checking my cluster more thoroughly I found out that the old repository was being used. But why was my cluster not knowing the new one? I was upgrading each major version – since the beginning.

Next search for the information on how to change it – nothing on Kubernetes docs (WTF!) but in some change request. You need to see the kubeadm-config ConfigMap in your kube-system namespace. There you’ll find the repository address. Changing this to the correct name finally did the trick and the upgrade succeeded.

But the more I think about this challenge, the more angry I get.

  1. How can such an essential change not be communicated more prominently – especially since the old repository was abandoned with v1.18.6 – the last image version in the old repo. Every upgrade document sind 1.18 must have a warning that the old repo is out of order now and a link to the change procedure
  2. Why is the error message not telling anything useful? The stacktrace is useless for the information about what happened.
  3. And why – for God’s sake – does the upgrade procedure itself not check for this essential change? Especially since v1.18.7.

This way of maintaining software is a very unprofessional one. Kubernetes is the foundation of so many productive systems now that this essential change must be taken more seriously by the maintainers. Breaking the procedures is a danger to all these systems and a proper communication or risk mitigation is not in place.

I need to stress out that upgrading Kubernetes is always risky. I experienced so many issues in the past that blocked an upgrade. Most of them were better documented and so I could resolve them. But this infrastructure change is a sign of unprofessional risk management. And I hope they will do much better next time.

Categories
Kubernetes

Kubernetes Service names in HELM templates

Based on my previous post, here comes a snippet that will correctly produce a full DNS name of a service in the cluster from the same namespace.

{{/*
Makes a full hostname from the given string if it's not one already or an IP address.
Attaches ".<namespace>.svc.cluster.local" to the end and includes the release name if required.
Please note that you need to call this template with (dict "Context" . "Value" "your-value")
*/}}
{{- define "prefix.serviceName" -}}
{{- if include "prefix.isIpAddress" .Value }}
    {{- print .Value }}
{{- else -}}
    {{- $parts := splitList "." .Value -}}
    {{- if gt (len $parts) 1 -}}
        {{- print .Value }}
    {{- else -}}
        {{- if eq .Context.Chart.Name .Context.Release.Name -}}
            {{- printf "%s.%s.svc.cluster.local" .Value .Context.Release.Namespace }}
        {{- else -}}
            {{- printf "%s-%s.%s.svc.cluster.local" .Context.Release.Name .Value .Context.Release.Namespace }}
        {{- end -}}

    {{- end -}}
{{- end -}}
{{- end -}}

Please note that using the template is a bit more cumbersome due to some Go language issues:

serviceName-anIpAddress:  {{ include "prefix.serviceName" (dict "Context" . "Value" "1.2.3.4") }}
serviceName-anIpAddress2: {{ include "prefix.serviceName" (dict "Context" . "Value" "1.0.3.4") }}
serviceName-NoIpAddress:  {{ include "prefix.serviceName" (dict "Context" . "Value" "1.2.3.4.5") }}
serviceName-NoIpAddress2: {{ include "prefix.serviceName" (dict "Context" . "Value" "hello") }}
serviceName-NoIpAddress3: {{ include "prefix.serviceName" (dict "Context" . "Value" "hello.svc") }}
serviceName-NoIpAddress4: {{ include "prefix.serviceName" (dict "Context" . "Value" "hello.svc.cluster.local") }}
serviceName-NoIpAddress5: {{ include "prefix.serviceName" (dict "Context" . "Value" "1") }}

The template needs access to the root context. So the dict function is used to pass the context and the actual, simple service name.

Feel free to adjust the function when you need another namespace as an argument.

Categories
Kubernetes

HELM template to detect IP address

I was in a need to detect whether the content of a variable is an IP address or not. I guess the function is not perfect, but it fulfills the basic need:

{{/*
Test if the given value is an IP address
*/}}
{{- define "prefix.isIpAddress" -}}
{{- $rc := . -}}
{{- $parts := splitList "." . -}}
{{- if eq (len $parts) 4 -}}
    {{- range $parts -}}
        {{- if and (not (atoi .)) (ne . "0") -}}
            {{- $rc = "" -}}
        {{- end -}}
    {{- end -}}
{{- else -}}
    {{- $rc = "" -}}
{{- end -}}
{{- print $rc }}
{{- end -}}

The function at least detects these values correctly:

{{ include "prefix.isIpAddress" "1.2.3.4" }}
{{ include "prefix.isIpAddress" "1.0.3.4" }}
{{ include "prefix.isIpAddress" "1.2.3.4.5" }}
{{ include "prefix.isIpAddress" "hello" }}
{{ include "prefix.isIpAddress" "hello.svc" }}
{{ include "prefix.isIpAddress" "hello.svc.tld.com" }}
Categories
Kubernetes

IPv6 with Kubernetes

Awwww – so much work I had put into setting up a Kubernetes cluster (this blog will run there in a few days). I set up the pods and containers, cron jobs, services, and, and, and. Then I started renewing my SSL certificates from LetsEncrypt. This renewal failed hilariously, but with a weird error message:

1
Timeout

What? I can reach my websites. Did I miss something? I checked connectivity. The IP addresses were right, the ACME challenge directory was available as required by LetsEncrypt, the DNS was working properly. Why couldn’t LetsEncrypt servers not reach my cluster? I soon found out that they prefer IPv6 over IPv4 which I had both enabled. But the IPv6 connection failed. From everywhere. Ping6 though succeeded.

Further analysis revealed that Kubernetes is not able to expose IPv6 services at all (or at least at now, so I researched). What shall I do now? All my work was based on the assumption that IPv4 and IPv6 will be there. But it’s not with Kubernetes. Of course I could move my reverse proxy out of Kubernetes and put it in front of it. But that would require more work as all the automation scripts for LetsEncrypt would need to be rebased. Testing again and again. Let aside the disadvantage of not having it all self-contained in containers anymore. Another solution must be there.

Luckily there was an easy solution: socat. It’s a small Linux tool that can copy network traffic from one socket to another. So that was setup easily with a systemd script (sock_80.service):

1
2
3
4
5
6
7
8
9
10
11
12
[Unit]
 Description=socat Service 80
 After=network.target
 
[Service]
 Type=simple
 User=root
 ExecStart=/usr/bin/socat -lf /var/log/socat80.log TCP6-LISTEN:80,reuseaddr,fork,bind=[ip6-address-goes-here] TCP4:ip4-address-goes-here:80
 Restart=on-abort
 
[Install]
 WantedBy=multi-user.target

That’s it. Enabled it (systemctl enable sock_80.service), reloaded systemd (systemctl daemon-reload), and started the service (systemctl start sock_80). Voilá! Here we go. IPv6 traffic is now routed to IPv4. I repeated it with port 443 and the setup is done. And LetsEncrypt servers are happy too 🙂