Showing posts with label Apache airflow. Show all posts
Showing posts with label Apache airflow. Show all posts

What is Apache Airflow

 Airflow?


Apache airflow is an open source platform for developing, scheduling and monitoring batch oriented workflows. Airflow's extensible python framework allows you to write code and create dag for workflows.

What is Dag?


A dag is called as directed acyclic graph means while creating or designing a dag, Cycle should not be formed. Here for example if I execute a task A and task A executes task B, task B should execute task C. Task C should not be calling task A otherwise there will be a cycle formed and In math you can relate it to transitive dependency. So there shouldn't be a transitive dependency.

Dag represents a workflow which is a collection of tasks. Below is how a dag looks like:




Airflow has a nice GUI by which you can manage creating variables and other admin stuff. You can also view your dag on GUI , trigger it or cancel it. Everything you can manage via GUI.

Use Case for Airflow


Suppose you have a unix job that  processes a text file at specific time and then it calls other job which does it's part. But sometimes you missed the file and your first job would still run and fail and your next dependent job will also run and process no data. So to overcome this scenario we can use Airflow and create a DAG where all tasks will be dependent on each other. So when we get a file then only our first job will be triggered and once first job will be succeeded our next job will run.

Let's write your first dag


from datetime import datetime
from airflow import DAG
from airflow.decorators import task
from airflow.operators.bash import BashOperator

# A DAG represents a workflow, a collection of tasks
with DAG(dag_id="demo", start_date=datetime(2022, 1, 1), schedule="0 0 * * *") as dag:

# Tasks are represented as operators
hello = BashOperator(task_id="hello", bash_command="echo hello")

@task()
def airflow():
print("airflow")

# Set dependencies between tasks
hello >> airflow()

In above code you need to import the DAG from airflow and BashOperator.

When you have a requirement to execute linux commands and you want to create a workflow use BashOperator. Similarly when you have a requirement to write a python function and execute it then use pythonOperator. In a summary BashOperator is used to execute bash commands and pythonOperator is used to execute the python commands.

Now once you have imported everything required the next line would be to define your dag.
When Airflow scheduler runs, it looks for a line which has "with DAG" in it. When it finds it then it creates a dag based on the config you passed.

After that there is BashOperator used where we creating a task with the name "hello" and executing command using BashOperator. You need to give a unique task id to each task keep it remember. And on the last line you can plan your flow like which task has to run first and last. 

Once you save the file the Airflow scheduler will automatically pick the new file and create the DAG. If any error occurs it will be displayed on the top of the Airflow UI.




Conclusion


Airflow is an awesome framework to use. It can be very useful to orchestrate all legacy unix jobs and flows. We just to choose airflow for right use case and it can solve those problems easily.


Quantum Computing: The Future of Supercomputing Explained

  Introduction Quantum computing is revolutionizing the way we solve complex problems that classical computers struggle with. Unlike tradi...