Apache Airflow: The Orchestrator, Not the Pipeline

A client came to us with a data pipeline that had grown organically over two years: 40+ cron jobs, some calling Python scripts, some running SQL files, some triggering other cron jobs via touch files. The dependencies existed in a wiki page that was last updated 8 months ago. When the daily ETL failed, debugging meant SSH-ing into three servers and reading log files.

We replaced the entire system with Airflow in 4 weeks. Not because Airflow is magic, but because it solves the right problem: orchestration. Scheduling, dependency management, retries, alerting, and visibility. The actual data processing still happens in PostgreSQL, pandas, and dbt. Airflow just makes sure it runs in the right order, at the right time, and that someone knows when it doesn't.

The problem

Data pipelines are not individual scripts. They're dependency graphs. Extract depends on the source API being available. Transform depends on extract completing. Load depends on transform succeeding. Report generation depends on load. Alert dispatch depends on report generation detecting anomalies.

Cron doesn't understand dependencies. It runs jobs at times, not in sequences. When Job B depends on Job A, you schedule Job B 30 minutes after Job A and hope A finishes in time. When it doesn't -- because the source API was slow, or the data volume doubled -- everything downstream fails silently.

How we think about orchestration

Three principles:

The orchestrator schedules and monitors. It does not process data. Airflow triggers a dbt run, a Spark job, or a Python script. It doesn't run the ETL logic itself. Stuffing data processing into Airflow operators is the most common Airflow anti-pattern.
Every task must be idempotent. Running the same task twice with the same inputs must produce the same result. This enables backfills and retries without corruption.
DAGs are code, not config. DAG files are Python. They live in Git. They're reviewed, tested, and versioned like application code.

A real DAG from production

This DAG runs daily for a logistics client. It extracts delivery data from an API, transforms it, loads it into a warehouse, and generates a fleet utilization report.

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "email": ["ops@commitx.dev"],
}

with DAG(
    "daily_fleet_etl",
    default_args=default_args,
    schedule="0 6 * * *",
    start_date=datetime(2025, 6, 1),
    catchup=False,
    tags=["logistics", "etl"],
) as dag:

    extract = PythonOperator(
        task_id="extract_delivery_data",
        python_callable=extract_from_fleet_api,
        op_kwargs={"date": "{{ ds }}"},
    )

    validate = PythonOperator(
        task_id="validate_records",
        python_callable=validate_delivery_records,
    )

    transform = BashOperator(
        task_id="run_dbt_transform",
        bash_command="cd /opt/dbt && dbt run --select tag:fleet_daily",
    )

    load_report = PostgresOperator(
        task_id="generate_utilization_report",
        postgres_conn_id="warehouse",
        sql="sql/fleet_utilization.sql",
    )

    extract >> validate >> transform >> load_report

Note what Airflow does here: it calls external tools. extract_from_fleet_api is a function in a separate Python module. dbt run is an external process. The SQL runs in PostgreSQL. Airflow's job is to run them in order, retry on failure, and alert on sustained failure.

Executor choices

The executor determines how Airflow runs tasks:

Executor	When we use it
LocalExecutor	Development, single-server deployments under 20 DAGs
CeleryExecutor	Medium scale, 20-100 DAGs, shared infrastructure
KubernetesExecutor	Production at scale. Each task runs in an isolated pod

We deploy with KubernetesExecutor for client projects. Each task gets its own pod with its own resource limits and dependencies. A pandas task gets a 2GB memory pod. A lightweight API call gets a 256MB pod. No resource contention between tasks.

What we learned

XCom is for metadata, not data. Airflow's XCom system passes small values between tasks. We've seen teams use it to pass dataframes. This breaks at scale and stores data in the Airflow metadata database. Pass file paths or S3 keys instead.
TaskGroups replace SubDAGs. SubDAGs are deprecated and were always problematic (deadlocks, resource pool exhaustion). TaskGroups give you visual grouping without the operational pain.
The scheduler is a single point of failure. Airflow's scheduler needs to be highly available. We run 2 scheduler replicas in active-passive with shared metadata in PostgreSQL. A scheduler restart delays all DAGs by 30-60 seconds.
Backfill is Airflow's superpower. When we fixed a bug in the transform logic, we backfilled 90 days of data with one command: airflow dags backfill --start-date 2025-10-01 daily_fleet_etl. Every day re-ran in order, with retries and logging. This alone justifies Airflow over cron.
DAG file parsing overhead is real. Airflow parses every DAG file on a configurable interval (default: 30 seconds). If DAG files import heavy libraries at the top level, parsing slows the scheduler. Keep DAG files light -- defer imports inside callables.

The tradeoffs

Airflow is not lightweight. The minimum deployment is a scheduler, a webserver, a metadata database, and a message broker (for Celery) or Kubernetes access (for KubernetesExecutor). Budget a day for initial setup.
The web UI is functional, not beautiful. It shows DAG structure, task status, logs, and run history. It does not have a modern UX. It's a monitoring tool, not a product.
Python-only DAGs. DAGs are Python files. If your data team works in R or Scala, Airflow orchestrates their scripts as BashOperators, but the orchestration layer itself is Python.
Not for streaming. Airflow is batch-oriented. For real-time event processing, use Kafka + Flink or Kafka + a streaming consumer. Airflow handles the scheduled, batch, dependency-driven workflows.

Our recommendation

If you have more than 5 scheduled data jobs with dependencies between them, replace cron with Airflow. The dependency management, retry logic, backfill capability, and web-based monitoring are worth the setup cost.

Keep Airflow thin. It's the orchestrator, not the processing engine. Use dbt for SQL transformations, pandas or Spark for data processing, and external tools for everything else. Airflow's job is to call them in order and tell you when something breaks.