In this blog, I will explain how to perform Crash Loop Detection in Kubernetes and ensure that your deployment is stable before you start routing live traffic to your new service version.
During the last few months, I have been working on a dockerized Python library to orchestrate the deployment of microservices to a Kubernetes/ Istio Service Mesh. The deployment process is based on a single manifest file which contains the details of the deployment and the various metrics and thresholds. We deploy the container with new service version to the Kubernetes cluster. After that we do a Crash Loop Detection to ensure that all the pods are in a running state. Once we determine that the service is in a stable state, we apply routing policies and send some live traffic to the new service version(Canary Release). At this point, we start running error and performance validations using Prometheus against the new service version.
If the numbers exceed the defined threshold we abort the deployment. Else if the validations have run successfully for a period of time without any issues, we go ahead and promote the deployment to the next stage where we send more live traffic to the new service version. After all stages are run, we now make the new version as the live version of the service and cleanup the old service version. This new automated deployment strategy provides a faster feedback to the development teams. It also reduces any manual testing effort, and makes telemetry and monitoring part of the CI/CD Pipeline itself.
For the purpose of this blog, I will focus on how you can validate your K8s deployment by doing a Crash Loop detection.
In my earlier blog, I showed you how to iterate through all the pods in a namespace and verify if they are in a ‘Running’ status —
Consider a scenario when your service crashes and Kubernetes (Replication Controller) automatically restarts the service so that it can maintain the desired number of replicas. If your service keeps crashing continuously then it is in a crash loop. In K8s there is a pod status called ‘CrashLoopBackOff‘ to indicate that the pod is in a crash loop status.
If you are seeing that the pods are consistently crashing after doing a deployment, then you have an issue at your hand which needs to be investigated and resolved.
There are multiple reasons why this can happen —
- There might be an issue with the K8s Deployment.
- You might have passed in incorrect parameters while creating your deployment.
- Its not able to pull down the application image from the repository.
- There might be issues with the kubelet like OutofDisk, Insufficient CPU, MemoryPressure, Disk Pressure.
Look at the below image which shows the status of the pods as ‘ErrImagePull‘ and ‘ImagePullBackOff‘ status. You can easily recreate this scenario by providing an incorrect image path while deploying. Obviously you do not want to continue with your deployment process, if the pods are not in a good state.
To troubleshoot such issues, you can execute a ‘describe pod‘ command and look at the error events. In the above case, the logs stated that —
Failed to pull image. Image does not exist or no pull access.
I have written a Python Module which can validate your deployment, look at the pod status and determine if it is in a crash loop —
- Takes in the service name, version, namespace, delay and waitperiod as arguments.
- Uses the Python API Client for Kubernetes to interact with the K8s cluster and determine the pod status. It fetches the pods using the list_namespaced_pod method within the namespace and filters it with the ‘application name’ label.
It loops over the pods and fetches the pod status using the read_namespaced_pod_status method.
Uses multiple metrics to determine crash loops —
If the Pod Restart Count is greater than 0 , it immediately sets the crashloopdetected flag to True and exits out — indicating that a CrashLoop has been detected.
Next it looks at the Pod Status and verifies if it is in the expected Running state.
There might be scenarios where a pod might take some time to come to a Running Status. If the pod is Initializing, we want to wait for it to come up.
- To determine the root cause of the crash loop, it captures the waiting reason of the container running inside the pod. When I bumped into some issues with the K8s cluster earlier, this field gave the most accurate indication of what the actual cause of the pod failure is. You can definitely add more diagnostics as needed.
- The WaitPeriod of say 60 sec states that it will wait for 60sec to determine if the Pod comes to a running state. Within that time if the pod comes to a running state, it will exit out immediately.
The Delay parameter of say 10 sec indicates that every 10sec it checks the Pod Status and verify if it in Running state.
The module returns a boolean value indicating whether Crash Loop was detected or not.
Please let me know if you have any questions and I would be happy to discuss.
If you are interested in learning more about Kubernetes, do not forget to check out my other articles in the Kubernetes series —