Apache Spark
info
This integration is known to work with latest Spark versions as well as Apache Spark 2.4. Please refer here for up-to-date information on versions supported.
This integration employs the SparkListener interface through OpenLineageSparkListener, offering
a comprehensive monitoring solution. It examines SparkContext-emitted events to extract metadata
associated with jobs and datasets, utilizing the RDD and DataFrame dependency graphs. This method
effectively gathers information from various data sources, including filesystem sources (e.g., S3
and GCS), JDBC backends, and data warehouses such as Redshift and Bigquery.