For anyone watching the space, the acceleration of the data revolution over the last few years has been very exciting. What started as experimental deployments of “big data” projects back in the early days of Hadoop has now morphed into full production, mission-critical deployments of a whole ecosystem of new data tools – not just in leading edge tech companies but also, increasingly, across every industry.

However, as data technologies conquer the world, the stakes are getting increasingly higher.

In particular, it is becoming of the utmost importance that data be always available when needed, up-to-date, and correct. In other words, data needs to be trusted to power mission-critical activities.

Unfortunately, the growing importance of data technologies is also accompanied by a corresponding increase in overall complexity, associated with the following trends:

  • the number of tools and platforms in the ecosystem has exploded (see Matt Turck's 2020 data landscape)
  • access to data usage across the enterprise is increasingly democratized, with many more users
  • companies are evolving away from a centralized data team to distributed data teams better aligned with the various functions (sales & marketing, operations, finance, ...), spread throughout the enterprise
  • data usage has evolved from primarily analytical use cases to a mix of analytical and operational use cases where data and AI is an integral part of the product or operation of the business
  • data is increasingly funneled into AI models, with increased sensitivity around data quality, fairness and transparency

Talking with practitioners who operate these data ecosystems on a daily basis, there is an obvious tension between the growing importance of data technologies, on the one hand, and the tools available to manage them as the mission-critical systems they are becoming, on the other hand – resulting in many inefficiencies, the inability to provide strong guarantees, and thus a lack of trust in the data being used.

Data lineage needs to follow a standard agreed upon by contributors to the open source community to guarantee the compatibility and consistency of the metadata produced by their respective solutions. – Julien Le Dem, CTO and Co-Founder of Datakin

As a result, there is an obvious, and still unmet, need for an end-to-end management layer for data that enables the smooth operation of such a complex ecosystem and collaboration between teams producing and consuming data.

The Importance of Data Lineage

In order to fulfill the true potential of this ongoing data revolution, our industry needs to make strong investments in such a management layer. It will need to include many capabilities and features including:

  • data catalogs to help inventory and facilitate the discovery and usage of datasets
  • end-to-end operational tools to provide strong guarantees around data availability and quality
  • access control to support data privacy needs
  • governance and compliance solutions

But it is also becoming increasingly clear that a key foundation to all these capabilities is strong data lineage, i.e. understanding how data flows across the whole ecosystem: who produces data? How does it get transformed? Who is using it? Data lineage is the backbone of DataOps, providing visibility into the interaction of systems and datasets across the journey of data within an organization.

For data lineage to be truly useful, it needs to meet certain requirements:

  • it needs to capture not only dependencies between the datasets being produced but also the business logic producing and transforming them
  • each of these datasets and programs needs to have a form of unified naming so that they can be easily identified and uniformly accessed across different domains
  • all changes in these datasets and programs need to be tracked and versioned with a fine granularity and in an automatic fashion to better understand the evolution over time of the overall ecosystem
  • the metadata describing these datasets and programs needs to be flexible and extensible given the variety of use cases it needs to power

Launching OpenLineage

As industry practitioners, our belief is that the first step to building the right foundation for data lineage is through an industry-wide approach around standard definition. Given the diversity of technical solutions and industry players, data lineage needs to follow a standard agreed upon by these various players to guarantee the compatibility and consistency of the metadata produced by their respective solutions.

Today, we're excited to announce the launch of OpenLineage, a new effort to define such a flexible industry standard for data lineage.  While initiated by us (Datakin, the builders of the open source metadata project Marquez), this is by nature a cross-industry effort involving a number of carefully selected participants. We are honored to have been joined by a great group of creators and contributors of major open source projects including (in alphabetical order):

  • Airflow
  • Amundsen
  • Datahub
  • dbt
  • Egeria
  • Great Expectations
  • Iceberg
  • Marquez
  • Pandas
  • Parquet
  • Prefect
  • Spark
  • Superset

The key goals of OpenLineage are to help reduce fragmentation and duplication of efforts across industry players, and enable the development of various tools and solutions in terms of data operations, governance, and compliance.

This initiative is just starting, and we hope to welcome many other participants. If you are interested in being part of this initiative, you can join our community on Twitter, Slack, Google Groups or GitHub.