Skip to main content

Capturing dataset statistics in Apache Spark

· 4 min read
Paweł Leszczyński
OpenLineage Committer

OpenLineage events enable the creation of a consistent lineage graph, where dataset vertices are connected through the jobs that read from and write to them. This graph becomes even more valuable when its nodes and edges are enriched with additional metadata for practical use cases. One important aspect to capture is the amount of data processed, as it facilitates applications such as cost estimation and data quality monitoring, among others. In this post, we introduce recent developments in Spark dataset statistics collection and reporting within OpenLineage events. We outline the basic statistics included, as well as the detailed scan and commit reports generated when processing Iceberg datasets.

Data lineage for Apache Flink

· 4 min read
Paweł Leszczyński
OpenLineage Committer

Apache Flink is a powerful stream processing engine widely adopted in the industry. Flink jobs have been capable of emitting OpenLineage events for already two years. However, it is a recent joint effort of Flink and OpenLineage communities that will bring this integration into the new era.

Simplify OpenLineage Configuration with Dynamic Environment Variables

· 4 min read
Jakub Dardziński
OpenLineage Committer

The OpenLineage community has frequently highlighted the need for a more flexible and scalable approach to configuration management, particularly through environment variables. Many users reported challenges in maintaining separate configuration files across development, testing, and production environments. In response, the 1.23.0 release introduces dynamic environment variables—addressing these concerns head-on by simplifying the configuration process, reducing code changes, and improving security across different environments.

In this guide, we'll explore how to leverage dynamic environment variables to simplify OpenLineage configuration, enhance flexibility, and improve security.

Meet up with us in San Francisco & Zoom on September 12th

· 2 min read
Michael Robinson
OpenLineage Community Manager

Note: this event is now hybrid.

Join us on Thursday, September 12th, 2024, from 6:00-9:00 pm PT at the Astronomer offices in San Francisco or on Zoom to learn more about the present and future of OpenLineage. Meet other members of the ecosystem, learn about the project’s goals and fundamental design, and participate in a discussion about the future of the project. Bring your ideas and vision for OpenLineage!

Agenda:

  • Unlocking Data Products with OpenLineage at Astronomer: Julian LaNeve (CTO, Astronomer) and Jason Ma (VP of Product, Astronomer)
  • OpenLineage: From Operators to Hooks by Maciej Obuchowski, Astronomer+GetInData/Xebia
  • Activating Operational Metadata with Airflow, Atlan and Openlineage by Kacper Muda, GetInData/Xebia
  • Hamilton, a Scaffold for all Your Python Platform Concerns (and a New OpenLineage Producer) by Stefan Krawczyk, CEO of DAGWorks
  • Lightning Talk on New Marquez Features and the Marquez Project Roadmap by Willy Lulciuc, Marquez Lead, and Peter Hicks, Marquez Committer

Thank you to our sponsors:

Astronomer
GetInData/Xebia
LFAI & Data

Join us in Boston on March 19th

· One min read
Michael Robinson
OpenLineage Community Manager

Join us on Tuesday, March 19th, 2024, from 5:30-8:00 pm at the Microsoft New England Conference Center in Boston to learn more about the current state of lineage in general and static lineage support in data catalogs in particular. Bring your ideas and vision for data lineage!

OpenLineage Support for Streaming to Feature at Kafka Summit

· One min read
Michael Robinson
OpenLineage Community Manager

At this year's Kafka Summit in London, two project committers, Paweł Leszczyński and Maciej Obuchowski, will give a talk entitled OpenLineage for Stream Processing on March 19th at 2:00 PM GMT.

As the abstract available on the summit website says, the talk will cover some of the 'many useful features completed or begun' recently related to stream processing, including:

  • a seamless OpenLineage & Apache Flink integration,
  • support for streaming jobs in Marquez,
  • progress on a built-in lineage API within the Flink codebase.

As the abstract goes on to say,

Cross-platform lineage allows for a holistic overview of data flow and its dependencies within organizations, including stream processing. This talk will provide an overview of the most recent developments in the OpenLineage Flink integration and share what’s in store for this important collaboration. This talk is a must-attend for those wishing to stay up-to-date on lineage developments in the stream processing world.

Register and attend this interesting talk if you can. And keep an eye out for an announcement about a recording if and when one becomes available.

Thanks, Maciej and Paweł, for spreading the word about these exciting developments in the project.

Join us in London on January 31st

· One min read
Michael Robinson
OpenLineage Community Manager

Join us on Wednesday, January 31st, 2024, from 6:00-8:00 pm at the Confluent offices in London to learn more about the current state of lineage in general and streaming support in particular. Bring your ideas and vision for OpenLineage!

Meet Us in Warsaw on November 29th!

· One min read
Michael Robinson
OpenLineage Community Manager

Join us on Wednesday, November 29th, 2023, from 17:30-20:30 CET in Warsaw, Poland, to contribute to a discussion of the future of OpenLineage. On the tentative agenda:

  1. Mary Idamkina on OpenLineage in GCP Dataplex
  2. Paweł Leszczynski on recent developments in the Spark Integration
  3. Jakub Dardziński on Extracting lineage from PythonOperator - how come this is possible?
  4. Paweł Leszczynski on How to Become a Spark-OpenLineage Contributor in 5 Steps