🚀 Introduction to Data Pipelines with Apache Airflow with Setup Guide
Hello everyone! 👋
Welcome to the first post of my 15-day series on learning Apache Airflow! Every day, I’ll break down one chapter from the book “Data Pipelines with Apache Airflow” by Bas Harenslak & Julian Rutger de Ruiter and give you a brief, to the point summary to help you get started with data pipelines. Let’s dive in!
📘 What is Apache Airflow?
Apache Airflow is an open-source platform that allows you to orchestrate complex data workflows. It was developed to automate the scheduling and monitoring of tasks involved in data pipelines. These pipelines consist of multiple processes that need to run in a specific order (often with dependencies).
If you’ve ever needed to run data extraction, transformation, and loading (ETL) processes on a daily or hourly basis, Airflow is designed to help!
✨ Key Concepts Introduced in Chapter 1:
Workflows & DAGs (Directed Acyclic Graphs)
- Airflow manages workflows by representing them as DAGs.
- A DAG is a collection of tasks that are executed based on dependencies, with a directed flow from start to finish.
- Example: A data pipeline where raw data is first extracted, transformed, and finally loaded into a database.
Tasks in Airflow
- Each node in a DAG represents a task.
- Tasks can be anything from running a Python script to querying a database.
- Tasks run in parallel or sequentially based on the DAG’s dependencies.
Schedulers
- Airflow has a scheduler that ensures tasks run at the right time and in the correct order.
- You can schedule tasks to run at specific intervals (e.g., daily, hourly).
Monitoring
- Monitoring pipelines in Airflow is a breeze! With its web UI, you can track the status of each task and troubleshoot failures.
🛠️ Setting Up Apache Airflow
Before diving into creating DAGs, let’s install and set up Apache Airflow on your local machine. Here’s a step-by-step guide:
1. Set Up a Python Environment
It’s always a good practice to create a virtual environment to avoid conflicts with other packages. Thus to create one, you can run the following commands:
python3 -m venv airflow_venv
source airflow_venv/bin/activate
2. Install Apache Airflow
Airflow requires a specific installation setup to function properly. Use the following commands to install the necessary dependencies:
export AIRFLOW_VERSION=2.6.3
export PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
3. Initialize the Airflow Database
Airflow uses a database to store metadata. To initialize the database, run:
airflow db init
4. Create a User for the Web UI
You’ll need a user to access the Airflow UI. Run this command to create an admin user:
airflow users create \
--username admin \
--firstname FIRST_NAME \
--lastname LAST_NAME \
--role Admin \
--email admin@example.com
5. Start the Airflow Web Server and Scheduler
Now that everything is set up, start the web server and the scheduler in separate terminals:
Start the web server:
airflow webserver --port 8080
And, start the scheduler:
airflow scheduler
This will enable you to access the Web UI:
Go to http://localhost:8080 in your browser, log in with the credentials you created earlier, and you’re all set!
👨💻 My Takeaways
Reading the first chapter, I learned that the core strength of Airflow lies in its ability to manage workflows efficiently. By automating dependencies and scheduling tasks, it saves time and reduces manual effort in handling data pipelines. For someone new to data pipelines, Airflow helps visualize, monitor, and control these processes easily.
📝 My First DAG. A hands on example which i started with. Here’s a quick look at a basic DAG that runs a Python script daily:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
# Define the DAG
dag = DAG('my_first_dag', start_date=datetime(2024, 10, 1), schedule_interval='@daily')
# Define the task
def my_first_task():
print("Hello, this is my first Airflow task!")
# Task using PythonOperator
task = PythonOperator(
task_id='print_hello',
python_callable=my_first_task,
dag=dag,
)
This simple DAG runs the my_first_task() Python function once a day, printing “Hello, this is my first Airflow task!” to the logs.
💡 Pro Tip for New Learners
Don’t get overwhelmed by the jargon. Focus on understanding the flow of tasks (DAGs) and how to structure your pipeline, and the rest will follow!
🔔 Stay tuned for Day 2 where I’ll cover Airflow’s Architecture in detail. If you have any questions or thoughts about today’s post, feel free to drop them in the comments below. Let’s learn together!
Cheers, Aditya