Skip to main content

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

· 6 min read
Barak Forgoun
Guest Blogger & Founder of Foundational

Data lineage is the cornerstone of modern data governance, providing transparency, traceability, and accountability across the data lifecycle. It ensures that organizations can track how data flows through systems, transforms between processes, and ultimately impacts downstream analytics and decision making. For data leaders, lineage is critical to maintaining trust in data by ensuring its accuracy, managing risks, and complying with regulatory requirements.

Beyond its role in compliance and security, lineage is essential for operational efficiency. It allows teams to perform impact analyses before making changes, reducing the risk of breaking pipelines or disrupting critical workflows. When combined with metadata and integrated into governance tools, lineage offers a powerful way to visualize data dependencies, troubleshoot issues, and ensure stewardship across the organization.

OpenLineage, the leading standard for capturing and sharing lineage, has transformed how organizations manage runtime-based lineage emitted from tools like Airflow and Spark. Yet, this focus on runtime lineage leaves a gap when it comes to static code, which governs rarely executed and ad hoc pipelines. Addressing this gap is essential to achieving a comprehensive view of lineage across the data ecosystem.

This blog explores why lineage extracted from code is indispensable, how Foundational extracts lineage directly from code, the challenges of integrating it into OpenLineage, and how a community-driven approach can address these challenges to provide a holistic view of data lineage.

The Evolution of Lineage: From Runtime to Code

OpenLineage has excelled at capturing lineage during pipeline execution. Whether tracking Spark transformations or Airflow DAGs, its runtime-centric approach has provided unmatched visibility into active pipelines. However, relying solely on runtime lineage creates blind spots, particularly for rarely executed or complex code paths.

Examples of Missing Runtime Lineage:

  • Rarely Executed Pipelines: For example, restore jobs triggered only during incidents or annual computations might not emit runtime lineage frequently enough to provide actionable insights.
  • Specific code-paths that rarely trigger Some pipelines run frequently, but certain code paths are triggered only under rare conditions. For example, dbt incremental models typically run as incremental updates, and full refreshes are seldom executed, leaving their lineage underrepresented.
  • Rarely Viewed Dashboards Dashboards that are infrequently accessed might not emit queries often enough to appear in runtime lineage. For instance, renaming a column might not show any immediate impact, but it could break an obscure dashboard used only once a quarter or year.

In these scenarios, runtime lineage cannot provide the full picture. To fill this gap, lineage extracted directly from code becomes critical. Together, runtime and code-based lineage create a comprehensive view, ensuring organizations are fully informed regardless of pipeline execution frequency.

Why Code-Based Lineage Matters

Extracting lineage from code is a non-trivial process because it requires static analysis to understand the data flow and dependencies within the codebase. This involves examining how data is transformed and moved across various parts of the code. Foundational provides a solution for extracting lineage from code for platforms like dbt, Spark, ORMs like SQLAlchemy and other platforms by analyzing the code to identify how data is transformed and moved between different components.

Code-based lineage, like that implemented by Foundational, complements runtime lineage by providing coverage for potential lineage—a view of what could happen when the code is executed. This distinction is invaluable for several high-stakes use cases:

  1. Regulatory Compliance In industries like finance and healthcare, compliance requires proof that sensitive data (e.g., PII) is handled appropriately, even in rarely executed pipelines. Lineage extracted from code ensures there’s no pipeline—even an obscure or rarely run one—that inadvertently violates data regulations.
  2. Data Security Protecting sensitive customer data means verifying that no pipeline code inadvertently writes private information to publicly accessible locations. Code-based lineage identifies potential security risks, even in edge-case scenarios like a backup restore job copying sensitive data to public storage.
  3. Refactoring with Confidence When refactoring tables, columns, or schemas, understanding all dependencies—including those in rarely executed code—prevents unexpected breakages. Code-based lineage provides the safety net engineers need to proceed with confidence.

Modeling Code-Based Lineage in OpenLineage

OpenLineage’s current model revolves around the concept of a Job, representing a runtime activity that transforms one Entity (e.g., a table or file) into another. But what happens when there is no true runtime job—when lineage comes from static code analysis instead? In July 2023, Open Lineage introduced the concept of Static Lineage, which allows to model lineage that is not emitted from runtime, but rather statically, as “design” lineage. We leverage this, in order to model code-based lineage. This means that we use the Job object to represent the code location from which the lineage was extracted. This aligns well with the OpenLineage modeling, as Job defines a transformation while Run is the instance of that job - so in code-based lineage we can use the Job object without a Run object. We can also use some facets of Job which are suitable for use for code-based lineage, such as SourceCodeLocationJobFacet, in order to represent additional information, such as the specific code version identifier (e.g., commit hash), repository, etc. So, for example, for a piece of code that copies data from Table1 to Table2 the lineage would be:

Table1 → <Job (points to `source/foo.py`)> → Table2

This approach maintains compatibility with the existing OpenLineage model while providing a path forward for integrating code-based lineage.

Collaborating with the OpenLineage Community

While modeling via the Job object is a functional starting point, it leaves room for improvement. For example:

  • Creating a new facet for representing information for code-based lineage, such as the relevant code lines that are responsible for the specific transformation, pointing customers to the exact place in the code, rather than to just a source file.
  • Defining specific code annotations that can provide hints for the specific lineage, allowing authors of legacy code to benefit from code-based lineage by using hints in the code. The rationale here is that while there will be support for code-based lineage for popular frameworks like: dbt, Airflow, Spark, there probably won’t be support for legacy or home-grown systems.

Foundational is excited to collaborate with the OpenLineage community to refine this model and develop a standard that unites code-based and runtime lineage into a cohesive framework.

Conclusion: A Holistic View of Data Lineage

Data lineage is no longer just a nice-to-have; it’s a requirement for ensuring trust, compliance, and security in the modern data stack. OpenLineage has laid the foundation for open, runtime-based lineage, but it’s time to expand the scope.

By integrating code-based lineage, organizations can achieve full coverage of their data pipelines, capturing both what is happening and what could happen. This comprehensive approach unlocks new possibilities for compliance, security, and data engineering efficiency.

Foundational is already helping organizations extract lineage directly from their codebases, and we look forward to collaborating with the OpenLineage community to ensure this new frontier of lineage is modeled effectively. Together, we can build a lineage ecosystem that leaves no pipeline—or dependency—untracked.

Capturing dataset statistics in Apache Spark

· 4 min read
Paweł Leszczyński
OpenLineage Committer

OpenLineage events enable the creation of a consistent lineage graph, where dataset vertices are connected through the jobs that read from and write to them. This graph becomes even more valuable when its nodes and edges are enriched with additional metadata for practical use cases. One important aspect to capture is the amount of data processed, as it facilitates applications such as cost estimation and data quality monitoring, among others. In this post, we introduce recent developments in Spark dataset statistics collection and reporting within OpenLineage events. We outline the basic statistics included, as well as the detailed scan and commit reports generated when processing Iceberg datasets.

Data lineage for Apache Flink

· 4 min read
Paweł Leszczyński
OpenLineage Committer

Apache Flink is a powerful stream processing engine widely adopted in the industry. Flink jobs have been capable of emitting OpenLineage events for already two years. However, it is a recent joint effort of Flink and OpenLineage communities that will bring this integration into the new era.

Simplify OpenLineage Configuration with Dynamic Environment Variables

· 4 min read
Jakub Dardziński
OpenLineage Committer

The OpenLineage community has frequently highlighted the need for a more flexible and scalable approach to configuration management, particularly through environment variables. Many users reported challenges in maintaining separate configuration files across development, testing, and production environments. In response, the 1.23.0 release introduces dynamic environment variables—addressing these concerns head-on by simplifying the configuration process, reducing code changes, and improving security across different environments.

In this guide, we'll explore how to leverage dynamic environment variables to simplify OpenLineage configuration, enhance flexibility, and improve security.

Meet up with us in San Francisco & Zoom on September 12th

· 2 min read
Michael Robinson
OpenLineage Community Manager

Note: this event is now hybrid.

Join us on Thursday, September 12th, 2024, from 6:00-9:00 pm PT at the Astronomer offices in San Francisco or on Zoom to learn more about the present and future of OpenLineage. Meet other members of the ecosystem, learn about the project’s goals and fundamental design, and participate in a discussion about the future of the project. Bring your ideas and vision for OpenLineage!

Agenda:

  • Unlocking Data Products with OpenLineage at Astronomer: Julian LaNeve (CTO, Astronomer) and Jason Ma (VP of Product, Astronomer)
  • OpenLineage: From Operators to Hooks by Maciej Obuchowski, Astronomer+GetInData/Xebia
  • Activating Operational Metadata with Airflow, Atlan and Openlineage by Kacper Muda, GetInData/Xebia
  • Hamilton, a Scaffold for all Your Python Platform Concerns (and a New OpenLineage Producer) by Stefan Krawczyk, CEO of DAGWorks
  • Lightning Talk on New Marquez Features and the Marquez Project Roadmap by Willy Lulciuc, Marquez Lead, and Peter Hicks, Marquez Committer

Thank you to our sponsors:

Astronomer
GetInData/Xebia
LFAI & Data

Join us in Boston on March 19th

· One min read
Michael Robinson
OpenLineage Community Manager

Join us on Tuesday, March 19th, 2024, from 5:30-8:00 pm at the Microsoft New England Conference Center in Boston to learn more about the current state of lineage in general and static lineage support in data catalogs in particular. Bring your ideas and vision for data lineage!

OpenLineage Support for Streaming to Feature at Kafka Summit

· One min read
Michael Robinson
OpenLineage Community Manager

At this year's Kafka Summit in London, two project committers, Paweł Leszczyński and Maciej Obuchowski, will give a talk entitled OpenLineage for Stream Processing on March 19th at 2:00 PM GMT.

As the abstract available on the summit website says, the talk will cover some of the 'many useful features completed or begun' recently related to stream processing, including:

  • a seamless OpenLineage & Apache Flink integration,
  • support for streaming jobs in Marquez,
  • progress on a built-in lineage API within the Flink codebase.

As the abstract goes on to say,

Cross-platform lineage allows for a holistic overview of data flow and its dependencies within organizations, including stream processing. This talk will provide an overview of the most recent developments in the OpenLineage Flink integration and share what’s in store for this important collaboration. This talk is a must-attend for those wishing to stay up-to-date on lineage developments in the stream processing world.

Register and attend this interesting talk if you can. And keep an eye out for an announcement about a recording if and when one becomes available.

Thanks, Maciej and Paweł, for spreading the word about these exciting developments in the project.

Join us in London on January 31st

· One min read
Michael Robinson
OpenLineage Community Manager

Join us on Wednesday, January 31st, 2024, from 6:00-8:00 pm at the Confluent offices in London to learn more about the current state of lineage in general and streaming support in particular. Bring your ideas and vision for OpenLineage!