Prevent Pipeline Crashes: Real-Time Data Validation with Pydantic



Stop Broken Pipelines: Real-Time Data Validation with Pydantic

In modern data engineering, "Garbage In, Garbage Out" is no longer just a saying—it's a financial risk. If your Python ETL script expects a price as a float but receives a null or a string, your pipeline crashes, and your downstream stakeholders lose trust.

The solution? Contract-driven development using Pydantic.

1. What is Pydantic?

Pydantic is a data validation library for Python that enforces type hints at runtime. Instead of writing 50 if/else statements to check your data, you define a Schema (a Class), and Pydantic does the heavy lifting.

2. The Problem: The "Silent Fail"

Look at this standard dictionary from an API. If price is missing or id is a string instead of an int, your SQL database might reject it.

raw_data = {"id": "101", "name": "Sensor_A", "price": "None"}
# This will break your DB!

3. The Solution: Defining a Data Contract

With Pydantic, we create a "Gatekeeper" for our data.

from pydantic import BaseModel, field_validator
from typing import Optional

class UserData(BaseModel):
id: int
name: str
price: float
status: Optional[str] = "active"

# We can even add custom logic!
@field_validator('price')
def price_must_be_positive(cls, v):
if v < 0:
raise ValueError('Price cannot be negative')
return v

# Now, let's validate the "dirty" data
try:
clean_data = UserData(**raw_data)
print(clean_data.model_dump())
except Exception as e:
print(f"Data Validation Failed: {e}")

4. Integrating with Airflow or Kafka

In 2026, the best practice is to put this validation at the very start of your pipeline (the Ingestion layer).

  • If validation fails: Route the "dirty" data to a Dead Letter Queue (DLQ) for manual review.

  • If validation passes: Load the data into your Warehouse (Snowflake/BigQuery).

No comments:

Post a Comment

Terraform for Data Engineers: How to Automate Your Database Setup

  Stop Manual Setup: Deploy a PostgreSQL Database with Terraform If you are still manually creating databases in the AWS or Azure console, y...