Skip to main content

About OpenLineage

OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.

Design

OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution.

The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. We encourage you to familiarize yourself with the core model below:

image

How OpenLineage Benefits the Ecosystem

Below, we illustrate the challenges of collecting lineage metadata from multiple sources, schedulers and/or data processing frameworks. We then outline the design benefits of defining an Open Standard for lineage metadata collection.

BEFORE:

image

  • Each project has to instrument its own custom metadata collection integration, therefore duplicating efforts.
  • Integrations are external and can break with new versions of the underlying scheduler and/or data processing framework, requiring projects to ensure backwards compatibility.

WITH OPENLINEAGE:

image

  • Integration efforts are shared across projects.
  • Integrations can be pushed to the underlying scheduler and/or data processing framework; no longer does one need to play catch up and ensure compatibility!

Scope

OpenLineage defines the metadata for running jobs and their corresponding events. A configurable backend allows the user to choose what protocol to send the events to. Scope

Core model

Model

A facet is an atomic piece of metadata attached to one of the core entities. See the spec for more details.

Spec

The specification is defined using OpenAPI and allows extension through custom facets.

Integrations

The OpenLineage repository contains integrations with several systems.