Skip to main content

Apache Spark

info

This integration is known to work with latest Spark versions as well as Apache Spark 2.4. Please refer here for up-to-date information on versions supported.

This integration employs the SparkListener interface through OpenLineageSparkListener, offering a comprehensive monitoring solution. It examines SparkContext-emitted events to extract metadata associated with jobs and datasets, utilizing the RDD and DataFrame dependency graphs. This method effectively gathers information from various data sources, including filesystem sources (e.g., S3 and GCS), JDBC backends, and data warehouses such as Redshift and Bigquery.