DataTipss | Data Engineering, DevOps & Cybersecurity Blog: Cloud Engineering

Showing posts with label Cloud Engineering. Show all posts

Kubernetes Troubleshooting 2026: Fixing CrashLoopBackOff and OOMKilled Errors

The 2026 Kubernetes Survival Guide: Debugging "Silent" Pod Failures in an AI-Driven World

In 2026, your pods aren't just crashing because of bad code; they are often crashing because an AI-orchestrator misconfigured a resource limit or an Agent pushed a breaking schema change.

If you see a pod stuck in CrashLoopBackOff or OOMKilled, don't panic. Here is the professional 2026 workflow to fix it.

1. The "Describe" Command (Your First Line of Defense)

Before you check logs, check the Events. Most "silent" failures (like mounting a missing secret) won't even show up in application logs.

kubectl describe pod <pod-name>

What to look for: Scroll to the bottom under Events. Look for "FailedMount," "FailedScheduling," or "Back-off restarting failed container."

2. Hunting the "Exit Code 137" (The Memory Killer)

If your pod was running and suddenly vanished, check the Last State.

If you see Exit Code 137, it means the Linux OOM (Out of Memory) killer stepped in.

The 2026 Fix: Check if your AI Agent set the limits too low. In 2026, we recommend using a Vertical Pod Autoscaler (VPA) to let Kubernetes "right-size" the memory for you automatically.

3. The "Previous" Log Trick

When a pod is in a crash loop, kubectl logs often shows nothing because the container is currently dead. You need to see why the last one died.

kubectl logs <pod-name> --previous

This is the single most forgotten command by junior DevOps engineers, but it's the only way to see the stack trace of a crashed container.

4. Debugging "ImagePullBackOff" in 2026

Since we are using more private registries and AI-generated images, this error is common.

The Quick Fix: Run kubectl get events -n <namespace> --sort-by='.lastTimestamp'.
Often, the issue is a typo in the imagePullSecrets or the AI agent tried to pull a v2.0 tag that hasn't finished its CI/CD build yet.

Reduce AWS Bills by 60%: Automate EC2 Stop/Start with Python & Lambda

In 2026, cloud bills have become a top expense for most tech companies. One of the biggest "money-wasters" is leaving Development and Staging servers running over the weekend or at night when no one is using them.

If you have a t3.large instance running 24/7, you're paying for 168 hours a week. By shutting it down outside of 9-to-5 working hours, you can save over 65% on your monthly bill.

The Solution: "The DevSecOps Auto-Stop"

We will use a small Python script running on AWS Lambda that looks for a specific tag (like Schedule: OfficeHours) and shuts down any instance that shouldn't be running.

Step 1: Tag Your Instances

First, go to your AWS Console and add a tag to the instances you want to automate:

Key: AutoStop
Value: True

Step 2: The Python Script (Boto3)

Create a new AWS Lambda function and paste this code. It uses the boto3 library to talk to your EC2 instances.

import boto3

ec2 = boto3.client('ec2', region_name='us-east-1') # Change to your region

def lambda_handler(event, context):
    # Search for all running instances with the 'AutoStop' tag
    filters = [
        {'Name': 'tag:AutoStop', 'Values': ['True']},
        {'Name': 'instance-state-name', 'Values': ['running']}
    ]
    
    instances = ec2.describe_instances(Filters=filters)
    instance_ids = []

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_ids.append(instance['InstanceId'])

    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        print(f"Successfully stopped instances: {instance_ids}")
    else:
        print("No running instances found with AutoStop=True tag.")

Step 3: Set the Schedule (EventBridge)

You don't want to run this manually.

Go to Amazon EventBridge.
Create a "Schedule."
Use a Cron expression to trigger the Lambda at 7:00 PM every evening:
- cron(0 19 ? * MON-FRI *)

How to Fix Kubernetes CrashLoopBackOff: A Practical Guide

It’s the most famous (and frustrating) status in the Kubernetes world. You run kubectl get pods, and there it is: 0/1 CrashLoopBackOff.

Despite the scary name, CrashLoopBackOff isn’t actually the error—it’s Kubernetes telling you: "I tried to start your app, it died, I waited, and I’m about to try again."

Here is the "Triple Post" finale to get your cluster healthy before the weekend.

1. The "First 3" Commands

Before you start guessing, run these three commands in order. They tell you 90% of what you need to know.

Command	Why run it?
`kubectl describe pod <name>`	Look at the Events section at the bottom. It often says why it failed (e.g., OOMKilled).
`kubectl logs <name> --previous`	Crucial. This shows the logs from the failed instance before it restarted.
`kubectl get events --sort-by=.metadata.creationTimestamp`	Shows a timeline of cluster-wide issues (like Node pressure).

2. The Usual Suspects

If the logs are empty (a common headache!), the issue is likely happening before the app even starts.

OOMKilled: Your container exceeded its memory limit.
- Fix: Increase resources.limits.memory.
Config Errors: You referenced a Secret or ConfigMap that doesn't exist, or has a typo.
- Fix: Check the describe pod output for "MountVolume.SetUp failed".
Permissions: Your app is trying to write to a directory it doesn't own (standard in hardened images).
- Fix: Check your securityContext or Dockerfile USER permissions.
Liveness Probe Failure: Your app is actually running fine, but the probe is checking the wrong port.
- Fix: Double-check livenessProbe.httpGet.port.

3. The Pro-Tip: The "Sleeper" Debug

If you still can't find the bug because the container crashes too fast to inspect, override the entrypoint.

Update your deployment YAML to just run a sleep loop:

command: ["/bin/sh", "-c", "while true; do sleep 30; done;"]

Now the pod will stay "Running," and you can kubectl exec -it <pod> -- /bin/sh to poke around the environment manually!