Getting Started with Airflow and OpenLineage+Marquez
In this example, we'll walk you through how to enable Airflow DAGs to send lineage metadata to Marquez using OpenLineage.
You’ll Learn How To:
- configure Airflow to send OpenLineage events to Marquez
- write OpenLineage-enabled DAGs
- troubleshoot a failing DAG using Marquez
Table of Contents
- Step 1: Configure Your Astro Project
- Step 2: Add Marquez Services Using Docker Compose
- Step 3: Start Airflow with Marquez
- Step 4: Write Airflow DAGs
- Step 5: View Collected Metadata
- Step 6: Troubleshoot a Failing DAG with Marquez
Prerequisites
Before you begin, make sure you have installed:
Note: We recommend that you have allocated at least 2 CPUs and 8 GB of memory to Docker.
Configure Your Astro Project
Use the Astro CLI to create and run an Airflow project locally that will integrate with Marquez.
-
In your project directory, create a new Astro project:
$ ..
$ mkdir astro-marquez-tutorial && cd astro-marquez-tutorial
$ astro dev init -
Using curl, change into new directory
docker
and download some scripts required by Marquez services:$ mkdir docker && cd docker
$ curl -O "https://raw.githubusercontent.com/MarquezProject/marquez/main/docker/{entrypoint.sh,wait-for-it.sh}"
$ ..After executing the above, your project directory should look like this:
$ ls -a
. Dockerfile packages.txt
.. README.md plugins
.astro airflow_settings.yaml requirements.txt
.dockerignore dags tests
.env docker
.gitignore include -
Add the OpenLineage Airflow Provider and the Common SQL Provider to the requirements.txt file:
apache-airflow-providers-common-sql==1.7.2
apache-airflow-providers-openlineage==1.1.0For details about the Provider and its minimum requirements, see the Airflow docs.
-
To configure Astro to send lineage metadata to Marquez, add the following environment variables below to your Astro project's
.env
file:OPENLINEAGE_URL=http://host.docker.internal:5000
OPENLINEAGE_NAMESPACE=example
AIRFLOW_CONN_EXAMPLE_DB=postgres://example:example@host.docker.internal:7654/exampleThese variables allow Airflow to connect with the OpenLineage API and send events to Marquez.
-
It is a good idea to have Airflow use a different port for Postgres than the default 5432, so run the following command to use port 5678 instead:
astro config set postgres.port 5678
-
Check the Dockerfile to verify that your installed version of the Astro Runtime is 9.0.0+ (to ensure that you will be using Airflow 2.7.0+).
For example:
FROM quay.io/astronomer/astro-runtime:9.1.0
Add Marquez and Database Services Using Docker Compose
Astro supports manual configuration of services via Docker Compose using YAML.
Create new file docker-compose.override.yml
in your project and copy/paste the following into the file:
version: "3.1"
services:
web:
image: marquezproject/marquez-web:latest
container_name: marquez-web
environment:
- MARQUEZ_HOST=api
- MARQUEZ_PORT=5000
ports:
- "3000:3000"
depends_on:
- api
db:
image: postgres:14.9
container_name: marquez-db
ports:
- "6543:6543"
environment:
- POSTGRES_USER=marquez
- POSTGRES_PASSWORD=marquez
- POSTGRES_DB=marquez
example-db:
image: postgres:14.9
container_name: example-db
ports:
- "7654:5432"
environment:
- POSTGRES_USER=example
- POSTGRES_PASSWORD=example
- POSTGRES_DB=example
api:
image: marquezproject/marquez:latest
container_name: marquez-api
environment:
- MARQUEZ_PORT=5000
- MARQUEZ_ADMIN_PORT=5001
ports:
- "5000:5000"
- "5001:5001"
volumes:
- ./docker/wait-for-it.sh:/usr/src/app/wait-for-it.sh
links:
- "db:postgres"
depends_on:
- db
entrypoint: ["/bin/bash", "./wait-for-it.sh", "db:6543", "--", "./entrypoint.sh"]
redis:
image: bitnami/redis:6.0.6
environment:
- ALLOW_EMPTY_PASSWORD=yes
The above adds the Marquez API, database and Web UI, along with an additional Postgres database for the DAGs used in this example, to Astro's Docker container and configures them to use the scripts in the docker
directory you previously downloaded from Marquez.