DataTipss | Data Engineering, DevOps & Cybersecurity Blog: Troubleshooting

Showing posts with label Troubleshooting. Show all posts

Prevent Pipeline Crashes: Real-Time Data Validation with Pydantic

Stop Broken Pipelines: Real-Time Data Validation with Pydantic

In modern data engineering, "Garbage In, Garbage Out" is no longer just a saying—it's a financial risk. If your Python ETL script expects a price as a float but receives a null or a string, your pipeline crashes, and your downstream stakeholders lose trust.

The solution? Contract-driven development using Pydantic.

1. What is Pydantic?

Pydantic is a data validation library for Python that enforces type hints at runtime. Instead of writing 50 if/else statements to check your data, you define a Schema (a Class), and Pydantic does the heavy lifting.

2. The Problem: The "Silent Fail"

Look at this standard dictionary from an API. If price is missing or id is a string instead of an int, your SQL database might reject it.

raw_data = {"id": "101", "name": "Sensor_A", "price": "None"} 

# This will break your DB!

3. The Solution: Defining a Data Contract

With Pydantic, we create a "Gatekeeper" for our data.

from pydantic import BaseModel, field_validator
from typing import Optional

class UserData(BaseModel):
    id: int
    name: str
    price: float
    status: Optional[str] = "active"

    # We can even add custom logic!
    @field_validator('price')
    def price_must_be_positive(cls, v):
        if v < 0:
            raise ValueError('Price cannot be negative')
        return v

# Now, let's validate the "dirty" data
try:
    clean_data = UserData(**raw_data)
    print(clean_data.model_dump())
except Exception as e:
    print(f"Data Validation Failed: {e}")

4. Integrating with Airflow or Kafka

In 2026, the best practice is to put this validation at the very start of your pipeline (the Ingestion layer).

If validation fails: Route the "dirty" data to a Dead Letter Queue (DLQ) for manual review.
If validation passes: Load the data into your Warehouse (Snowflake/BigQuery).

Fix Docker Error: Permission Denied to Docker Daemon Socket (Without Sudo)

You just installed Docker, you’re excited to run your first container, and you type docker ps. Instead of a list of containers, you get this:

docker: Got permission denied while trying to connect to the Docker 

daemon socket at unix:///var/run/docker.sock

Why is this happening?

By default, the Docker daemon binds to a Unix socket which is owned by the root user. Because you are logged in as a regular user, you don't have the permissions to access that socket.

Most people "fix" this by typing sudo before every command, but that is dangerous and bad practice. Here is the professional way to fix it once and for all.

Step 1: Create the Docker Group

In most modern installations, this group already exists, but let’s make sure:

sudo groupadd docker

Step 2: Add your User to the Group

This command adds your current logged-in user to the docker group so you have the necessary permissions.

sudo usermod -aG docker $USER

Step 3: Activate the Changes

Important: Your terminal doesn't know you've joined a new group yet. You can either log out and log back in, or run this command to refresh your group memberships immediately:

newgrp docker

Step 4: Verify the Fix

Now, try running a command without sudo:

docker run hello-world

If you see the "Hello from Docker!" message, you’ve successfully fixed the permission issue!

Note:

Adding a user to the "docker" group is equivalent to giving them root privileges. Only add users you trust. If you are in a highly secure environment, consider using Rootless Docker, which allows you to run containers without needing root access at all.

How to Hide API Keys in Python: Stop Leaking Secrets to GitHub

Whether you are building a data pipeline in Airflow or a simple AI bot, you should never hard-code your API keys directly in your Python script. If you push that code to a public repository, hackers will find it in seconds using automated scanners.

Here is the professional way to handle secrets using Environment Variables and .env files.

1. The Tool: `python-dotenv`

The industry standard for managing local secrets is a library called python-dotenv. It allows you to store your keys in a separate file that never gets uploaded to the internet.

Install it via terminal:

pip install python-dotenv

2. Create your `.env` File

In your project’s root folder, create a new file named exactly .env. Inside, add your secrets like this:

# .env file
DATABASE_URL=postgres://user:password@localhost:5432/mydb
OPENAI_API_KEY=sk-your-secret-key-here
AWS_SECRET_ACCESS_KEY=your-aws-key

3. Access Secrets in Python

Now, you can load these variables into your script without ever typing the actual key in your code.

import os
from dotenv import load_dotenv

# Load the variables from .env into the system environment
load_dotenv()

# Access them using os.getenv
api_key = os.getenv("OPENAI_API_KEY")
db_url = os.getenv("DATABASE_URL")

print(f"Successfully connected to the database!")

4. The Most Important Step: `.gitignore`

This is where the "Security" part happens. You must tell Git to ignore your .env file so it never leaves your computer.

Create a file named .gitignore and add this line:

.env

Why this is a "DevSecOps" Win:

Security: Your keys stay on your machine.
Flexibility: You can use different keys for "Development" and "Production" without changing a single line of code.
Collaboration: Your teammates can create their own local .env files with their own credentials.

How to Fix Terraform Error: "Error acquiring the state lock"

You try to run terraform plan or apply, and instead of seeing your infrastructure changes, you get hit with this wall of text:

Error: Error locking state: Error acquiring the state lock Lock Info: 
ID: a1b2c3d4-e5f6-g7h8-i9j0 Operation: OperationTypePlan 
Who: user@workstation Created: 2026-01-17 10:00:00 UTC

Why does this happen?

Terraform locks your State File to prevent two people (or two CI/CD jobs) from making changes at the exact same time. This prevents infrastructure corruption. However, if your terminal crashes or your internet drops during an apply, Terraform might not have the chance to "unlock" the file.

Step 1: The Safe Way (Wait)

Before you do anything, check the Who and Created section in the error. If it says your colleague is currently running a plan, don't touch it. Wait for them to finish.

Step 2: The Manual Fix (Force Unlock)

If you are 100% sure that no one else is running Terraform (e.g., your own previous process crashed), you can manually break the lock using the Lock ID provided in the error message.

Run this command:

terraform force-unlock <LOCK_ID>

Example: terraform force-unlock a1b2c3d4-e5f6-g7h8-i9p0

Step 3: Handling Remote State (S3 + DynamoDB)

If you are using AWS S3 as a backend, Terraform uses a DynamoDB table to manage locks. If force-unlock fails, you can:

Go to the AWS Console.
Open the DynamoDB table used for your state locking.
Find the item with the Lock ID and manually delete the item from the table.

Pro-Tip: Preventing Future Locks

If this happens frequently in your CI/CD (like GitHub Actions or Jenkins), ensure you have a "Timeout" set. Also, always use a Remote Backend rather than local state files to ensure that if your local machine dies, the lock is manageable by the team.

terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock-table" # Always use this!
  }
}

Why is my Airflow Task stuck in "Queued" state? (5 Quick Fixes)

You’ve triggered your DAG, the UI shows the task as grey (Queued), but nothing happens for minutes—or hours. This is a classic Airflow bottleneck. Here is how to diagnose and fix it.

1. Check the "Concurrency" Limits

Airflow has several "safety brakes" to prevent your server from crashing. If you hit these limits, tasks will stay queued until others finish.

parallelism: The max number of task instances that can run across your entire Airflow environment.
dag_concurrency: The max number of tasks that can run for a single DAG.
max_active_runs_per_dag: If you have too many "Backfills" running, new tasks won't start.

The Fix: Check your airflow.cfg or your DAG definition. Increase max_active_tasks if your hardware can handle it.

2. Is the Scheduler actually alive?

Sometimes the Airflow UI looks fine, but the Scheduler process has died or hung.

Check the UI: Look at the top of the Airflow page. If there is a red banner saying "The scheduler does not appear to be running," that’s your answer.
The Fix: Restart the scheduler service:
systemctl restart airflow-scheduler
# OR if using Docker:
docker restart airflow-scheduler

3. "No Slots Available" in Pools

Airflow uses Pools to manage resources (like limiting how many tasks can hit a specific database at once). If your task belongs to a pool with 5 slots and 5 tasks are already running, your 6th task will stay Queued forever.

The Fix: Go to Admin -> Pools in the UI. Check if the "Default Pool" or your custom pool is full. Increase the slots if necessary.

4. Celery Worker Issues (For Production Setups)

If you are using the CeleryExecutor, the task is queued in Redis or RabbitMQ, but the Worker might not be picking it up.

The Check: Run airflow celery inspect_short to see if workers are online.
The Fix: Ensure your workers are pointed to the same Metadata DB and Broker as your Scheduler.

5. Resource Starvation (OOM)

If your worker node is out of RAM or CPU, it might accept the task but fail to initialize it, leading to a loop where the task stays queued.

How to Fix Kubernetes CrashLoopBackOff: A Practical Guide

It’s the most famous (and frustrating) status in the Kubernetes world. You run kubectl get pods, and there it is: 0/1 CrashLoopBackOff.

Despite the scary name, CrashLoopBackOff isn’t actually the error—it’s Kubernetes telling you: "I tried to start your app, it died, I waited, and I’m about to try again."

Here is the "Triple Post" finale to get your cluster healthy before the weekend.

1. The "First 3" Commands

Before you start guessing, run these three commands in order. They tell you 90% of what you need to know.

Command	Why run it?
`kubectl describe pod <name>`	Look at the Events section at the bottom. It often says why it failed (e.g., OOMKilled).
`kubectl logs <name> --previous`	Crucial. This shows the logs from the failed instance before it restarted.
`kubectl get events --sort-by=.metadata.creationTimestamp`	Shows a timeline of cluster-wide issues (like Node pressure).

2. The Usual Suspects

If the logs are empty (a common headache!), the issue is likely happening before the app even starts.

OOMKilled: Your container exceeded its memory limit.
- Fix: Increase resources.limits.memory.
Config Errors: You referenced a Secret or ConfigMap that doesn't exist, or has a typo.
- Fix: Check the describe pod output for "MountVolume.SetUp failed".
Permissions: Your app is trying to write to a directory it doesn't own (standard in hardened images).
- Fix: Check your securityContext or Dockerfile USER permissions.
Liveness Probe Failure: Your app is actually running fine, but the probe is checking the wrong port.
- Fix: Double-check livenessProbe.httpGet.port.

3. The Pro-Tip: The "Sleeper" Debug

If you still can't find the bug because the container crashes too fast to inspect, override the entrypoint.

Update your deployment YAML to just run a sleep loop:

command: ["/bin/sh", "-c", "while true; do sleep 30; done;"]

Now the pod will stay "Running," and you can kubectl exec -it <pod> -- /bin/sh to poke around the environment manually!

Fixing Docker Error: "conflict: unable to remove repository reference"

Have you ever tried to clean up your local machine by deleting old Docker images, only to be met with this frustrating message?

Error response from daemon: conflict: unable to remove repository reference

"my-image" (must force) - container <ID> is using its referenced image <ID>

This error happens because Docker is protective. It won't let you delete an image if there is a container—even a stopped one—that was created from it.

Step 1: Identify the "Zombie" Containers

The error message usually gives you a container ID. You can see all containers (running and stopped) that are blocking your deletion by running:

docker ps -a

Look for any container that is using the image you are trying to delete.

Step 2: Remove the Container First

Before you can delete the image, you must remove the container. If the container is still running, you’ll need to stop it first:

# Stop the container

docker stop <container_id>

# Remove the container

docker rm <container_id>

Step 3: Delete the Image

Now that the dependency is gone, you can safely remove the image:

docker rmi <image_name_or_id>

The "Shortcut" (Force Delete)

If you don't care about the containers and just want the image gone immediately, you can use the -f (force) flag.

Warning: This will leave "dangling" containers that no longer have a valid image reference.

docker rmi -f <image_id>

Pro Tip: The Bulk Cleanup

If your machine is cluttered with dozens of these conflicts, don't fix them one by one. Use the prune command to safely remove all stopped containers and unused images in one go:

docker system prune

(Add the -a flag if you also want to remove unused images, not just "dangling" ones.)

How to Fix PostgreSQL Error: "FATAL: sorry, too many clients already"

If you are seeing the error FATAL: sorry, too many clients already or FATAL: too many connections for role "username", your PostgreSQL instance has hit its limit of concurrent connections.

This usually happens when:

Your application isn't closing database connections properly.
You have a sudden spike in traffic.
A connection pooler (like PgBouncer) isn't configured.

Step 1: Check Current Connection Usage

Before changing any settings, you need to see who is using the connections. Run this query to get a breakdown of active vs. idle sessions:

SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;

If you see a high number of "idle" connections, your application is likely "leaking" connections (opening them but never closing them).

Step 2: Emergency Fix (Kill Idle Connections)

If your production site is down because of this error, you can manually terminate idle sessions to free up slots immediately:

-- This kills all idle connections older than 5 minutes

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < current_timestamp - interval '5 minutes';

Step 3: Increase `max_connections` (The Configuration Fix)

The default limit in PostgreSQL is often 100. If your hardware has enough RAM, you can increase this.

Find your config file: SHOW config_file;
Open postgresql.conf and find the max_connections setting.
Change it to a higher value (e.g., 200 or 500).
Restart PostgreSQL for changes to take effect.

Warning: Every connection consumes memory (roughly 5-10MB). If you set this too high, you might run the entire server out of RAM (OOM).

Step 4: The Professional Solution (Connection Pooling)

Increasing max_connections is a temporary fix. For a production-grade setup, you should use PgBouncer.

Instead of your application connecting directly to Postgres, it connects to PgBouncer. PgBouncer keeps a small pool of real connections open to the database and rotates them among hundreds of incoming requests.

Sample pgbouncer.ini configuration:

[databases]

mydatabase = host=127.0.0.1 port=5432 dbname=mydatabase

[pgbouncer]

listen_port = 6432

auth_type = md5

pool_mode = transaction

max_client_conn = 1000

default_pool_size = 20

Summary Checklist

Audit your code: Ensure every db.connect() has a corresponding db.close().
Monitor: Set up alerts for when connections exceed 80% of max_connections.
Scale: Use a connection pooler like PgBouncer or pg_pool if you have more than 100 active users.