Installation
- Version
1.8.0
and earlier only supported Scala 2.12 variants of Apache Spark. - Version
1.9.1
and later support both Scala 2.12 and 2.13 variants of Apache Spark.
The above necessitates a change in the artifact identifier for io.openlineage:openlineage-spark
.
After version 1.8.0
, the artifact identifier has been updated. For subsequent versions, utilize:
io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:1.25.0
.
To integrate OpenLineage Spark with your application, you can:
- Bundle the package with your Apache Spark application project.
- Place the JAR in your
${SPARK_HOME}/jars
directory - Use the
--jars
option withspark-submit / spark-shell / pyspark
- Use the
--packages
option withspark-submit / spark-shell / pyspark
Bundle the package with your Apache Spark application project
This approach does not demonstrate how to configure the OpenLineageSparkListener
.
Please refer to the Configuration section.
For Maven, add the following to your pom.xml
:
- After 1.8.0
- 1.8.0 and earlier
<dependency>
<groupId>io.openlineage</groupId>
<artifactId>openlineage-spark_${SCALA_BINARY_VERSION}</artifactId>
<version>1.25.0</version>
</dependency>
<dependency>
<groupId>io.openlineage</groupId>
<artifactId>openlineage-spark</artifactId>
<version>${OPENLINEAGE_SPARK_VERSION}</version>
</dependency>
For Gradle, add this to your build.gradle
:
- After 1.8.0
- 1.8.0 and earlier
implementation("io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:1.25.0")
implementation("io.openlineage:openlineage-spark:1.25.0")
Place the JAR in your ${SPARK_HOME}/jars
directory
This approach does not demonstrate how to configure the OpenLineageSparkListener
.
Please refer to the Configuration section.
- Download the JAR and its checksum from Maven Central.
- Verify the JAR's integrity using the checksum.
- Upon successful verification, move the JAR to
${SPARK_HOME}/jars
.
This script automates the download and verification process:
- After 1.8.0
- 1.8.0 and earlier
#!/usr/bin/env bash
if [ -z "$SPARK_HOME" ]; then
echo "SPARK_HOME is not set. Please define it as your Spark installation directory."
exit 1
fi
OPENLINEAGE_SPARK_VERSION='1.25.0'
SCALA_BINARY_VERSION='2.13' # Example Scala version
ARTIFACT_ID="openlineage-spark_${SCALA_BINARY_VERSION}"
JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar"
CHECKSUM_NAME="${JAR_NAME}.sha512"
BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}"
curl -O "${BASE_URL}/${JAR_NAME}"
curl -O "${BASE_URL}/${CHECKSUM_NAME}"
echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c
if [ $? -eq 0 ]; then
mv "${JAR_NAME}" "${SPARK_HOME}/jars"
else
echo "Checksum verification failed."
exit 1
fi
#!/usr/bin/env bash
if [ -z "$SPARK_HOME" ]; then
echo "SPARK_HOME is not set. Please define it as your Spark installation directory."
exit 1
fi
OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version
ARTIFACT_ID="openlineage-spark"
JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar"
CHECKSUM_NAME="${JAR_NAME}.sha512"
BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}"
curl -O "${BASE_URL}/${JAR_NAME}"
curl -O "${BASE_URL}/${CHECKSUM_NAME}"
echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c
if [ $? -eq 0 ]; then
mv "${JAR_NAME}" "${SPARK_HOME}/jars"
else
echo "Checksum verification failed."
exit 1
fi
Use the --jars
option with spark-submit / spark-shell / pyspark
This approach does not demonstrate how to configure the OpenLineageSparkListener
.
Please refer to the Configuration section.
- Download the JAR and its checksum from Maven Central.
- Verify the JAR's integrity using the checksum.
- Upon successful verification, submit a Spark application with the JAR using the
--jars
option.
This script demonstrate this process:
- After 1.8.0
- 1.8.0 and earlier
#!/usr/bin/env bash
OPENLINEAGE_SPARK_VERSION='1.25.0'
SCALA_BINARY_VERSION='2.13' # Example Scala version
ARTIFACT_ID="openlineage-spark_${SCALA_BINARY_VERSION}"
JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar"
CHECKSUM_NAME="${JAR_NAME}.sha512"
BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}"
curl -O "${BASE_URL}/${JAR_NAME}"
curl -O "${BASE_URL}/${CHECKSUM_NAME}"
echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c
if [ $? -eq 0 ]; then
spark-submit --jars "path/to/${JAR_NAME}" \
# ... other options
else
echo "Checksum verification failed."
exit 1
fi
#!/usr/bin/env bash
OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version
ARTIFACT_ID="openlineage-spark"
JAR_NAME="${ARTIFACT_ID}-${OPENLINEAGE_SPARK_VERSION}.jar"
CHECKSUM_NAME="${JAR_NAME}.sha512"
BASE_URL="https://repo1.maven.org/maven2/io/openlineage/${ARTIFACT_ID}/${OPENLINEAGE_SPARK_VERSION}"
curl -O "${BASE_URL}/${JAR_NAME}"
curl -O "${BASE_URL}/${CHECKSUM_NAME}"
echo "$(cat ${CHECKSUM_NAME}) ${JAR_NAME}" | sha512sum -c
if [ $? -eq 0 ]; then
spark-submit --jars "path/to/${JAR_NAME}" \
# ... other options
else
echo "Checksum verification failed."
exit 1
fi
Use the --packages
option with spark-submit / spark-shell / pyspark
This approach does not demonstrate how to configure the OpenLineageSparkListener
.
Please refer to the Configuration section.
Spark allows you to add packages at runtime using the --packages
option with spark-submit
. This
option automatically downloads the package from Maven Central (or other configured repositories)
during runtime and adds it to the classpath of your Spark application.
- After 1.8.0
- 1.8.0 and earlier
OPENLINEAGE_SPARK_VERSION='1.25.0'
SCALA_BINARY_VERSION='2.13' # Example Scala version
spark-submit --packages "io.openlineage:openlineage-spark_${SCALA_BINARY_VERSION}:1.25.0" \
# ... other options
OPENLINEAGE_SPARK_VERSION='1.8.0' # Example version
spark-submit --packages "io.openlineage:openlineage-spark::1.25.0" \
# ... other options