Skip to main content

Apache Airflow

Airflow is a widely-used workflow automation and scheduling platform that can be used to author and manage data pipelines. Airflow uses workflows made of directed acyclic graphs (DAGs) of tasks. To learn more about Airflow, check out the Airflow documentation.

How does Airflow work with OpenLineage?

Understanding complex inter-DAG dependencies and providing up-to-date runtime visibility into DAG execution can be challenging. OpenLineage integrates with Airflow to collect DAG lineage metadata so that inter-DAG dependencies are easily maintained and viewable via a lineage graph, while also keeping a catalog of historical runs of DAGs.

image

The DAG metadata collected can answer questions like:

  • Why has a DAG failed?
  • Why has the DAG runtime increased after a code change?
  • What are the upstream dependencies of a DAG?

How can I use this integration?

PREREQUISITES

To use the OpenLineage Airflow integration, you'll need a running Airflow instance. You'll also need an OpenLineage compatible backend.

INSTALLATION

To download and install the latest openlineage-airflow library, update the requirements.txt file of your running Airflow instance with:

openlineage-airflow

CONFIGURATION

Next, we'll need to specify where we want OpenLineage to send events. There are a few options. Simplest one is to use OPENLINEAGE_URL environment variable. For example, to send OpenLineage events to a local instance of Marquez, use:

OPENLINEAGE_URL=http://localhost:5000

To set up additional configuration, or send events to other targets than HTTP server (like Kafka topic) take a look at client configuration.

If you use older version of Airflow than 2.3.0, additional configuration is required.

Environment Variables

The following environment variables are available specifically for Airflow integration.

NameDescriptionSince
OPENLINEAGE_AIRFLOW_DISABLE_SOURCE_CODESet to False if you want source code of callables provided in PythonOperator or BashOperator NOT to be included in OpenLineage events
OPENLINEAGE_EXTRACTORSThe optional list of extractors class in case you need to use custom extractors.
Example: OPENLINEAGE_EXTRACTORS=full.path.to.ExtractorClass;full.path.to.AnotherExtractorClass
OPENLINEAGE_NAMESPACEThe optional namespace that the lineage data belongs to. If not specified, defaults to default

USAGE

When enabled, the integration will:

  • On TaskInstance start, collect metadata for each task
  • Collect task input / output metadata (source, schema, etc)
  • Collect task run-level metadata (execution time, state, parameters, etc)
  • On TaskInstance complete, also mark the task as complete in Marquez

Where can I learn more?

Feedback

You can reach out to us on slack and leave us feedback!