DataTipss | Data Engineering, DevOps & Cybersecurity Blog: Agentic AI

Traditional Airflow DAGs are "brittle." If the source data changes from a comma (,) to a pipe (|) delimiter, the task fails, the pipeline stops, and you have to fix it manually.

In this guide, we’ll build a "Try-Heal-Retry" loop. We will use a Python agent that intercepts failures, asks an LLM (like GPT-4o or Claude 3.5) for a fix, and automatically retries the task with the new logic.

1. The Architecture: The "Healer" Loop

Instead of a standard PythonOperator, we use a custom logic where the "Retry" phase is actually an "AI Repair" phase.

2. The Secret Sauce: The `on_failure_callback`

Airflow allows you to run a function whenever a task fails. This is where our AI Agent lives.

The Agent Logic:

Capture: Grab the last 50 lines of the task log and the failing code.
Consult: Send that "context" to an LLM with a strict prompt: "Find the error and return only the corrected Python parameters."
Execute: Update the Airflow Variable and trigger a retry.

3. Step-by-Step Implementation

Step A: The "Healer" Function

This function acts as your 24/7 on-call engineer.

import openai
from airflow.models import Variable

def ai_healer_agent(context):
    task_instance = context['ti']
    error_log = task_instance.xcom_pull(task_ids=task_instance.task_id, key='error_msg')
    
    prompt = f"The following Airflow task failed: {error_log}. Suggest a fix in JSON format."
    
    # AI identifies if it's a schema change, connection issue, or syntax error
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Store the 'fix' in an Airflow Variable for the next retry
    Variable.set("last_ai_fix", response.choices[0].message.content)

Step B: The Self-Healing DAG

We use the tenacity library or Airflow's native retries to loop back after the agent suggests a fix.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def my_data_task(**kwargs):
    # Check if the AI Agent left a 'fix' for us
    fix = Variable.get("last_ai_fix", default_var=None)
    # ... use the fix to run the code (e.g., change the delimiter) ...
    raise ValueError("Delimiter mismatch detected!") # Example failure

with DAG('self_healing_pipeline', start_date=datetime(2026, 1, 1), schedule='@daily') as dag:
    
    run_etl = PythonOperator(
        task_id='run_etl',
        python_callable=my_data_task,
        on_failure_callback=ai_healer_agent, # The Agent kicks in here!
        retries=1 
    )

4. Why this is the "Future" of DataTipss

MTTR (Mean Time To Recovery): You reduce your recovery time from hours to seconds.
Cost: You only pay for the LLM API call when a failure actually happens.
Human-in-the-loop: You can set the agent to "Suggest" a fix via Slack for you to approve with one click, rather than fully auto-fixing.

Build a Self-Healing Airflow Pipeline: Using AI Agents to Auto-Fix Errors

1. The Architecture: The "Healer" Loop

2. The Secret Sauce: The `on_failure_callback`

3. Step-by-Step Implementation

Step A: The "Healer" Function

Step B: The Self-Healing DAG

4. Why this is the "Future" of DataTipss

Build a Self-Healing Airflow Pipeline: Using AI Agents to Auto-Fix Errors

Build a Self-Healing Airflow Pipeline: Using AI Agents to Auto-Fix Errors

1. The Architecture: The "Healer" Loop

2. The Secret Sauce: The on_failure_callback

3. Step-by-Step Implementation

Step A: The "Healer" Function

Step B: The Self-Healing DAG

4. Why this is the "Future" of DataTipss

Build a Self-Healing Airflow Pipeline: Using AI Agents to Auto-Fix Errors

2. The Secret Sauce: The `on_failure_callback`