1.5.0 - 2023-11-02
Added
- Flink: add Flink lineage for Cassandra Connectors
#2175
@HuangZhenQiu
Adds Flink Cassandra source and sink visitors and Flink Cassandra Integration test. - Spark: support
rdd
andtoDF
operations available in Spark Scala API#2188
@pawel-big-lebowski
Includes the first Scala integration test, fixesExternalRddVisitor
and adds support for extracting inputs fromMapPartitionsRDD
andParallelCollectionRDD
plan nodes. - Spark: support Databricks Runtime 13.3
#2185
@pawel-big-lebowski
Modifies the Spark integration to support the latest Databricks Runtime version.
Changed
- Airflow: loosen attrs and requests versions
#2107
@JDarDagran
Lowers the version requirements for attrs and requests and removes an unnecessary dependency. - dbt: render yaml configs lazily
#2221
@JDarDagran
Don't render each entry in yaml files at start.
Fixed
- Airflow/Athena: change dataset name to its location
#2167
@sophiely
Replaces the dataset and namespace with the data's physical location for more complete lineage across integrations. - Python client: skip redaction in column lineage facet
#2177
@JDarDagran
Redacted fields inColumnLineageDatasetFacetFieldsAdditionalInputFields
are now skipped. - Spark: unify dataset naming for RDD jobs and Spark SQL
#2181
@pawel-big-lebowski
Use the same mechanism for RDD jobs to extract dataset identifier as used for Spark SQL. - Spark: ensure a single
START
and a singleCOMPLETE
event are sent#2103
@pawel-big-lebowski
For Spark SQL at least four events are sent triggered by different SparkListener methods. Each of them is required and used to collect facets unavailable elsewhere. However, there should be only oneSTART
andCOMPLETE
events emitted. Other events should be sent asRUNNING
. Please keep in mind that Spark integration remains stateless to limit the memory footprint, and it is the backend responsibility to merge several Openlineage events into a meaningful snapshot of metadata changes.