What is dbt?

dbt (data build tool) is a powerful transformation engine. It operates on data already within a warehouse, making it easy for data engineers to build complex pipelines from the comfort of their laptops. While it doesn’t perform extraction and loading of data, it’s extremely powerful at transformations.

To learn more about dbt, visit the documentation site or run through the getting started tutorial.

How does dbt work with OpenLineage?

Fortunately, dbt already collects a lot of the data required to create and emit OpenLineage events. When it runs, it creates a target/manifest.json file containing information about jobs and the datasets they affect, and a target/run_results.json file containing information about the run-cycle. These files can be used to trace lineage and job performance. In addition, by using the create catalog command, a user can instruct dbt to create a target/catalog.json file containing information about dataset schemas.

These files contain everything needed to trace lineage. However, the target/manifest.json and target/run_results.json files are only populated with comprehensive metadata after completion of a run-cycle.

This integration is implemented as a wrapper script, dbt-ol, that calls dbt and, after the run has completed, collects information from the three json files and calls the OpenLineage API accordingly. For most users, enabling OpenLineage metadata collection can be accomplished by simply substituting dbt-ol for dbt when performing a run.

Preparing a dbt project for OpenLineage

First, we need to install the integration:

pip3 install openlineage-dbt

Next, we specify where we want dbt to send OpenLineage events by setting the OPENLINEAGE_URL environment variable. For example, to send OpenLineage events to a local instance of Marquez, use:

OPENLINEAGE_URL=http://localhost:5000

Finally, we need to specify a namespace where the lineage events will be stored. For example, to use the namespace "dev":

OPENLINEAGE_NAMESPACE=dev

Running dbt with OpenLineage

To run your dbt project with OpenLineage collection, simply replace dbt with dbt-ol:

dbt-ol run

The dbt-ol wrapper supports all of the standard dbt subcommands, and is safe to use as a substitutuon (i.e., in an alias). Once the run has completed, you will see output containing the number of events sent via the OpenLineage API:

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
Emitted 4 openlineage events

Where can I learn more?

Feedback

What did you think of this guide? You can reach out to us on slack and leave us feedback!