Testing
Configurable Integration Test
Starting of version 1.17, OpenLineage Spark integration provides a command line tooling to help
creating custom integration tests. configurable-test.sh script can be used to build
openlineage-spark from the current directory, script arguments are used to pass Spark
job. Then, emitted OpenLineage events are validated against JSON files with expected events' fields. Build process and
integration test run itself is performed within Docker environment which makes the command
Java environment agnostic.
Quickstart: try running following command from OpenLineage project root directory:
./integration/spark/cli/configurable-test.sh --spark ./integration/spark/cli/spark-conf.yml --test ./integration/spark/cli/tests
This should run four integration tests ./integration/spark/cli/tests and store their output into
./integration/spark/cli/runs. Feel free to add extra test directories with custom tests.
What's happening when running configurable-test.sh command?
- At first, a docker container with Java 11 is created. It builds a docker image
openlineage-test:$OPENLINEAGE_VERSION. During the build process, all the internal dependencies (likeopenlineage-java) are added to the image. It's because we don't want to build it in each run as it speeds up single command run. In case of subproject changes, a new image has to be built. - Once the docker image is built, docker container is started and starts gradle
configurableIntegrationTesttask. Task depends onshadowJarto buildopenlineage-sparkjar. The built jar should be also available on host machine. - Gradle test task spawns additional Spark containers which run the Spark job and emit OpenLineage events to local file. A gradle test code has access to mounted event file location, fetches the events emitted and verifies them against expected JSON events. Matching is done through MockServer Json body matching with
ONLY_MATCHING_FIELDSflag set, as it's happening within other integration tests. - Test output is written into
./integration/spark/cli/runsdirectories with subdirectories containing test definition and file with events that was emitted.
Please be aware that first run of the command will download several gigabytes of docker images being used as well as gradle dependencies required to build JAR from the source code. All of them are stored within Docker volumes, which makes consecutive runs a way faster.
Command details
It is important to run command from the project root directory. This is the only way to let created Docker containers get mounted volumes containing spark integration code, java client code, sql integration code. Command has extra check to verify if work directory is correct.
Try running:
./integration/spark/cli/configurable-test.sh --help
to see all the options available within your version. These should include:
--spark- to define spark environment configuration file,--test- location for the directory containing tests,--clean- flague marking docker image to be re-build from scratch.
Spark configuration file
This an example Spark environment configuration file:
appName: "CLI test application"
sparkVersion: 3.3.4
scalaBinaryVersion: 2.12
enableHiveSupport: true
packages:
- org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2
sparkConf:
spark.openlineage.debugFacet.disabled: false
sparkVersionandscalaBinaryVersionare used to determine Spark and Scala version to be tested. Spark is run on docker from the images available in https://quay.io/repository/openlineage/spark?tab=tags. A combination of Spark and Scala version provided within the config has to match images available.appNameandenableHiveSupportparameters are used when starting Spark session.sparkConfcan be used to pass any spark configuration entries. OpenLineage transport defined is file based with a specified file location and is set within the test being run. Those settings should not be overrider.packageslets define custom jar packages to be installed withspark-submitcommand.
As of version 1.18, Spark configuration can accept instead of sparkVersion, a configuration
entries to determine Docker image to be run on:
appName: "CLI test application"
docker:
image: "apache/spark:3.3.3-scala2.12-java11-python3-ubuntu"
sparkSubmit: /opt/spark/bin/spark-submit
waitForLogMessage: ".*ShutdownHookManager: Shutdown hook called.*"
scalaBinaryVersion: 2.12
where:
imagespecifies docker image to be used to run Spark job,sparkSubmitis file location ofspark-submitcommand,waitForLogMessageis regex for log entry determining a Spark job is finished.
Tests definition directories
- Specified test directory should contain one or more directories and each of the subdirectories contains separate test definition.
- Each test directory should contain a single
.sqlor.pypySpark code file containing a job definition. For.sqlfile each line of the file is decorated withspark.sql()and transformed into pySpark script. For pySpark scripts, a user should instantiate SparkSession with OpenLineage parameters configured properly. Please refer to existing tests for usage examples. - Each test directory should contain on or more event definition file with
.jsonextensions defining an expected content of any of the events emitted by the job run.