🚀 Mastering Task Dependencies in Apache Airflow

Last updated on Oct 11, 2024 4 min read

Hello everyone! 👋

Welcome back to Day 5 of our 15-day journey into mastering Apache Airflow! Today’s topic builds on the foundations of DAGs and task definitions to introduce a critical concept: Defining Task Dependencies. Task dependencies allow us to create complex, real-world data workflows that execute seamlessly, even when branching, conditional logic, and shared data come into play.

In previous chapters, we explored simple DAGs with linear dependencies. However, real-world workflows require a more nuanced approach to defining task dependencies. Let’s dive into how Airflow manages these dependencies and look at examples of how to implement them.

Basic Dependencies

1- Linear Dependencies

This is the most straightforward form of dependency, where tasks are executed one after another in sequence.

Example:

extract >> transform >> load

In this example, the task extract runs first, followed by transform, and finally load. Each task must complete successfully before the next one begins.

2- Fan-out/Fan-in Patterns

In more complex workflows, a single task might trigger multiple tasks, or multiple tasks might converge into one task.

Example:

download_launches >> [get_pictures, download_metadata]
[get_pictures, download_metadata] >> notify

This structure allows multiple tasks to run in parallel, then converge back to a single task once all tasks have completed.

Branching Dependencies

Branching Within Tasks

Airflow supports conditional logic through branching. With branching, we can direct our workflow down different paths based on the outcome of a task.

Example:

from airflow.operators.python import BranchPythonOperator

def choose_path():
    return "task_A" if condition else "task_B"

branch = BranchPythonOperator(
    task_id='branch_task',
    python_callable=choose_path
)

In this case, the task branch_task will determine which path to follow (task_A or task_B) based on a condition.

Branching Within DAGs

Branching can be used not only within tasks but also to split the entire DAG into different paths.

Conditional Tasks

Airflow also allows us to make tasks conditional based on task outcomes, enabling workflows to adapt dynamically to different situations.

Using Built-in Operators

Conditional tasks can be implemented using built-in Airflow operators like BranchPythonOperator, which allows tasks to run based on custom conditions.

Example:

def skip_task():
    return 'skip_task' if some_condition else 'run_task'

branching_task = BranchPythonOperator(
    task_id='branching_task',
    python_callable=skip_task
)

Here, the task branching_task will determine whether to skip a task or execute it based on a condition.

Trigger Rules

Airflow supports a variety of trigger rules to control when downstream tasks are executed. Some common trigger rules include:

Trigger Rule	Behavior	Example Use Case
`all_success`	(default) Triggers when all parent tasks have been completed successfully	The default trigger rule for a normal workflow
`all_failed`	Triggers when all parent tasks have failed	Trigger error handling code in situations where all tasks in a group fail
`all_done`	Triggers when all parents are done with execution, regardless of state	Execute cleanup code (e.g., shutting down a machine after all tasks finish)
`one_failed`	Triggers as soon as at least one parent has failed	Quickly trigger error handling (e.g., notifications or rollbacks)
`one_success`	Triggers as soon as one parent succeeds	Trigger downstream computations/notifications as soon as one result is ready
`none_failed`	Triggers if no parents have failed but have either succeeded or been skipped	Join conditional branches in DAGs, useful for branching workflows
`none_skipped`	Triggers if no parents were skipped, regardless of success or failure	Trigger tasks if all upstream tasks were executed, regardless of their results
`dummy`	Triggers regardless of the state of any upstream tasks	Useful for testing

Airflow uses XComs (short for “cross-communications”) to share data between tasks. This feature allows one task to pass information to another, ensuring smooth communication throughout the DAG.

Using XComs

Tasks can push and pull data from XComs, like in this example:

task_A.xcom_push(key='data', value=my_data)
task_B.xcom_pull(task_ids='task_A', key='data')
This ensures that task_B can access data produced by task_A.

When Not to Use XComs?

Although XComs are useful, they should not be used to pass large datasets. XComs are stored in the metadata database, so they are better suited for small pieces of data like IDs or messages.

Taskflow API: Chaining Python Tasks

Airflow’s Taskflow API simplifies the process of chaining Python tasks together. It provides decorators for Python functions to automatically convert them into tasks, which reduces the boilerplate code needed.

Example:

@task
def extract():
    return data

@task
def transform(data):
    return transformed_data

@task
def load(transformed_data):
    print("Loading data")

In this case, the @task decorator converts each Python function into an Airflow task, and the workflow can be easily built by chaining them together.

Pro Tip for Learners:

Use Trigger Rules wisely when designing complex DAGs to handle various failure conditions. These rules allow you to create robust workflows that can adapt to different scenarios, preventing a single failure from derailing the entire pipeline. Also, avoid overusing XComs for large datasets—opt for efficient external storage instead!
Refer to Chapter 5 of “data pipelines with apache airflow” for more in-depth understanding of each concepts mentioned in the blog above.

Cheers, Aditya

Tech

🚀 Mastering Task Dependencies in Apache Airflow

Basic Dependencies

1- Linear Dependencies

2- Fan-out/Fan-in Patterns

Branching Dependencies

Branching Within Tasks

Branching Within DAGs

Conditional Tasks

Using Built-in Operators

Trigger Rules

Using XComs

When Not to Use XComs?

Taskflow API: Chaining Python Tasks

Pro Tip for Learners:

Aditya Paliwal

Data Engineer @ Telenet

Related

🚀 Mastering Task Dependencies in Apache Airflow

Basic Dependencies

1- Linear Dependencies

2- Fan-out/Fan-in Patterns

Branching Dependencies

Branching Within Tasks

Branching Within DAGs

Conditional Tasks

Using Built-in Operators

Trigger Rules

Sharing Data Between Tasks

Using XComs

When Not to Use XComs?

Taskflow API: Chaining Python Tasks

Pro Tip for Learners:

Aditya Paliwal

Data Engineer @ Telenet

Related