OpenLineage is an open framework for data lineage collection and analysis. At its core is an extensible specification that systems can use to interoperate with lineage metadata.
OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution.
The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The core model is highly extensible via facets. A facet is user-defined metadata and enables entity enrichment. We encourage you to familiarize yourself with the core model below:
How OpenLineage Benefits the Ecosystem
Below, we illustrate the challenges of collecting lineage metadata from multiple sources, schedulers and/or data processing frameworks. We then outline the design benefits of defining an Open Standard for lineage metadata collection.
- Each project has to instrument its own custom metadata collection integration, therefore duplicating efforts.
- Integrations are external and can break with new versions of the underlying scheduler and/or data processing framework, requiring projects to ensure backwards compatibility.
- Integration efforts are shared across projects.
- Integrations can be pushed to the underlying scheduler and/or data processing framework; no longer does one need to play catch up and ensure compatibility!
OpenLineage defines the metadata for running jobs and their corresponding events. A configurable backend allows the user to choose what protocol to send the events to.
A facet is an atomic piece of metadata attached to one of the core entities. See the spec for more details.
The specification is defined using OpenAPI and allows extension through custom facets.
The OpenLineage repository contains integrations with several systems.