This guide covers how you can quickly get started collecting dataset, job, and run metadata using OpenLineage. We'll first introduce you to OpenLineage's core model, show how to collect run-level metadata as OpenLineage events using Marquez as the HTTP backend, then explore lineage metadata via the Marquez UI.

PREREQUISITES

Before you begin, make sure you have installed:

Note: In this guide, we'll be using Marquez as the OpenLineage HTTP backend and running the HTTP server via Docker.

RUN MARQUEZ WITH DOCKER

The easiest way to get up and running with Marquez is Docker. Check out the Marquez source code and run the ./docker/up.sh script:

$ git clone git@github.com:MarquezProject/marquez.git && cd marquez

$ ./docker/up.sh

Tip: Pass the --build flag to the script to build images from source, or --tag X.Y.Z to use a tagged image.

To view the Marquez UI and verify it's running, open http://localhost:3000. The UI enables you to discover dependencies between jobs and the datasets they produce and consume via the lineage graph, view run-level metadata of current and previous job runs, and much more.

OpenLineage Core Model

Below, we illustrate the challenges of collecting lineage metadata from multiple sources, schedulers and/or data processing frameworks. We then outline the design benefits of defining an Open Standard for lineage metadata collection.

BEFORE:

  • Each project has to instrument it's own custom metadata collection integration, therefore duplicating efforts.
  • Integrations are external and can break with new versions of the underlying scheduler and/or data processing framework, requiring projects to ensure backwards compatibility.

WITH OPENLINEAGE:

  • Integration efforts are shared across projects.
  • Integrations can be pushed to the underlying scheduler and/or data processing framework; no need to play catch up and ensure compatibility!

DESIGN:

OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution. The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. We encourage you to familiarize yourself with the core model below:

BENEFITS:

  • An open standard with a specification for collecting lineage metadata.
  • Focuses on job-level execution.
    • Runs
    • Datasets
  • Event-based metadata collection.
  • Extensible model via facets.

Collect Run-Level Metadata using Marquez

Marquez is an LF AI & DATA incubation project to collect, aggregate, and visualize a data ecosystem’s metadata. Marquez is the reference implementation of the OpenLineage standard.

In this example, we show how you can collect dataset and job metadata using Marquez. Using the LineageAPI, metadata will be collected as OpenLineage events using the run ID d46e465b-d358-4d32-83d4-df660ff614dd. The run ID will enable the tracking of run-level metadata over time for the job my-job. So, let's get started!

Note: The example shows how to collect metadata via direct HTTP API calls using curl. But, you can also get started using our client library for Java or Python.

STEP 1: START A RUN

Use d46e465b-d358-4d32-83d4-df660ff614dd to start the run for my-job with my-input as the input dataset:

REQUEST
$ curl -X POST http://localhost:5000/api/v1/lineage \
  -H 'Content-Type: application/json' \
  -d '{
        "eventType": "START",
        "eventTime": "2020-12-28T19:52:00.001+10:00",
        "run": {
          "runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
        },
        "job": {
          "namespace": "my-namespace",
          "name": "my-job"
        },
        "inputs": [{
          "namespace": "my-namespace",
          "name": "my-input"
        }],  
        "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client"
      }'
RESPONSE

201 CREATED

STEP 2: COMPLETE A RUN

Use d46e465b-d358-4d32-83d4-df660ff614dd to complete the run for my-job with my-output as the output dataset. We also specify the schema facet to collect the schema for my-output before marking the run as completed. Note, you don't have to specify the input dataset my-input again for the run since it already has been associated with the run ID:

REQUEST
$ curl -X POST http://localhost:5000/api/v1/lineage \
  -H 'Content-Type: application/json' \
  -d '{
        "eventType": "COMPLETE",
        "eventTime": "2020-12-28T20:52:00.001+10:00",
        "run": {
          "runId": "d46e465b-d358-4d32-83d4-df660ff614dd"
        },
        "job": {
          "namespace": "my-namespace",
          "name": "my-job"
        },
        "outputs": [{
          "namespace": "my-namespace",
          "name": "my-output",
          "facets": {
            "schema": {
              "_producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client",
              "_schemaURL": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/spec/OpenLineage.json#/definitions/SchemaDatasetFacet",
              "fields": [
                { "name": "a", "type": "VARCHAR"},
                { "name": "b", "type": "VARCHAR"}
              ]
            }
          }
        }],     
        "producer": "https://github.com/OpenLineage/OpenLineage/blob/v1-0-0/client"
      }'
RESPONSE

201 CREATED

View Collected Lineage Metadata via Marquez UI

SEARCH JOB METADATA

To view lineage metadata collected by Marquez, browse to the UI by visiting http://localhost:3000. Then, use the search bar in the upper right-side of the page and search for the job my-job. To view lineage metadata for my-job, click on the job from the drop-down list:

image

VIEW JOB METADATA

You should see the job namespace, name, my-input as an input dataset and my-output as an output dataset in the lineage graph and the job run marked as COMPLETED :

image

VIEW INPUT DATASET METADATA

Finally, click on the output dataset my-output for my-job. You should see the dataset name, schema, and description:

image

Summary

In this simple example, we showed you how to use Marquez to collect dataset and job metadata with Openlineage. We also walked you through the set of HTTP API calls to successfully mark a run as complete and view the lineage metadata collected with Marquez.

Next Steps

Feedback

What did you think of this guide? We would love to hear feedback, and we can be found on the OpenLineage Slack.