infrapuzzle/bootstrap/explanation_cert_issue.md

66 lines
6.3 KiB
Markdown

# Understanding and Resolving Kubernetes Certificate Expiration in Kubespray
## Introduction: The Role of Certificates in Kubernetes
A Kubernetes cluster relies heavily on TLS certificates to secure communication between its various components. The API server, controller manager, scheduler, etcd, and kubelets all use certificates to authenticate and encrypt traffic. These certificates are issued with a specific validity period (usually one year) for security reasons. When they expire, components can no longer trust each other, leading to a cluster-wide failure.
The error message you encountered is a classic symptom of this problem:
```
E1116 13:47:01.271977 ... failed to verify certificate: x509: certificate has expired or is not yet valid
```
This indicates that `kubectl` (and other components) could not validate the certificate presented by the Kubernetes API server because the current date was past the certificate's expiration date.
## The Core Problem: A "Chicken-and-Egg" Deadlock
When the certificates expired, the initial and correct instinct was to use Kubespray's provided automation to fix it. In your version of Kubespray, the `upgrade-cluster.yml` playbook is the designated tool for this job, as it includes tasks to regenerate certificates.
However, this approach led to a deadlock, manifesting as a timeout during the "Create kubeadm token for joining nodes" task. Here's a breakdown of why this happened:
1. **API Server is Down:** The primary certificate for the Kubernetes API server (`apiserver.crt`) had expired. This prevented the API server from starting correctly and serving traffic on its secure port (6443).
2. **Playbook Needs the API Server:** The `upgrade-cluster.yml` playbook, specifically the `kubeadm` tasks within it, needs to communicate with a healthy Kubernetes API server to perform its functions. To create a join token for other nodes, `kubeadm` must make a request to the API server.
3. **The Deadlock:** The playbook was trying to connect to the API server to fix the certificates, but it couldn't connect precisely *because* the certificates were already expired and the API server was unhealthy. This created a "chicken-and-egg" scenario where the automated solution couldn't run because the problem it was meant to fix was preventing it from running.
## The Solution: Manual Intervention on the Control Plane
To break this deadlock, we had to manually restore the core health of the control plane on the master node (`haumdaucher`) *before* letting the automation take over again. The process involved SSHing into the master node and using the `kubeadm` command-line tool to regenerate the essential certificates and configuration files.
Here is a detailed look at the commands executed and why they were necessary:
1. **`sudo -i`**
* **What it does:** Switches to the `root` user.
* **Why:** Modifying files in `/etc/kubernetes/` requires root privileges.
2. **`mv /etc/kubernetes/pki /etc/kubernetes/pki.backup-...`**
* **What it does:** Backs up the directory containing all the existing (and expired) cluster certificates.
* **Why:** This is a critical safety measure. If the manual renewal process failed, we could restore the original state to diagnose the problem further.
3. **`kubeadm init phase certs all`**
* **What it does:** This is the core of the manual fix. It tells `kubeadm` to execute *only* the certificate generation phase of the cluster initialization process. It creates a new Certificate Authority (CA) and uses it to sign a fresh set of certificates for all control plane components (API server, controller-manager, scheduler, etcd).
* **Why:** This directly replaces the expired certificates with new, valid ones, allowing the API server and other components to trust each other again.
4. **`mv /etc/kubernetes/*.conf /etc/kubernetes/*.conf.backup-...`**
* **What it does:** Backs up the kubeconfig files used by the administrator (`admin.conf`) and the control plane components.
* **Why:** These files contain embedded client certificates and keys for authentication. Since we just created new certificates, we also need to generate new kubeconfig files that use them. Backing them up is a standard precaution.
5. **`kubeadm init phase kubeconfig all`**
* **What it does:** This command generates new kubeconfig files (`admin.conf`, `kubelet.conf`, etc.) that use the newly created certificates from the previous step.
* **Why:** This ensures that all components, as well as the administrator using `kubectl` on the node, can successfully authenticate to the now-healthy API server.
6. **`systemctl restart kubelet`**
* **What it does:** Restarts the kubelet service on the master node.
* **Why:** The kubelet is the agent that runs on each node and is responsible for managing pods. It needs to be restarted to load its new configuration (`kubelet.conf`) and re-establish a secure connection to the API server using the new certificates.
## Why the Solution Works
By performing these manual steps, we effectively gave the control plane a "jump-start." We manually created the valid certificates and configuration files needed for the API server to start up successfully.
Once the API server was healthy and listening, the `upgrade-cluster.yml` playbook could be re-run. This time, when the `kubeadm` tasks within the playbook tried to connect to the API server to create join tokens, the connection succeeded. The playbook was then able to complete its remaining tasks, ensuring all nodes in the cluster were properly configured and joined.
## Future Prevention and Best Practices
1. **Monitor Certificate Expiration:** Use tools like `kubeadm certs check-expiration` or monitoring solutions (e.g., Prometheus with `kube-state-metrics`) to track certificate expiry dates proactively.
2. **Consider Upgrading Kubespray:** Newer versions of Kubespray may include a dedicated `renew-certs.yml` playbook. This playbook is designed for certificate rotation specifically and is less disruptive than a full `upgrade-cluster.yml`, as it typically avoids rotating service account keys.
3. **Understand the Manual Process:** Keeping this guide handy will allow you to quickly resolve similar deadlocks in the future without extensive troubleshooting. The key is recognizing that automation sometimes needs a manual boost when the system it's trying to fix is too broken to respond.