Spark Config Parameters
The following parameters can be specified:
Parameter | Definition | Example |
---|---|---|
spark.openlineage.transport.type | The transport type used for event emit, default type is console | http |
spark.openlineage.namespace | The default namespace to be applied for any jobs submitted | MyNamespace |
spark.openlineage.parentJobNamespace | The job namespace to be used for the parent job facet | ParentJobNamespace |
spark.openlineage.parentJobName | The job name to be used for the parent job facet | ParentJobName |
spark.openlineage.parentRunId | The RunId of the parent job that initiated this Spark job | xxxx-xxxx-xxxx-xxxx |
spark.openlineage.rootParentJobNamespace | The namespace of the root parent job | ParentJobNamespace |
spark.openlineage.rootParentJobName | The name of the root parent job | ParentJobName |
spark.openlineage.rootParentRunId | The RunId of the root parent job | xxxx-xxxx-xxxx-xxxx |
spark.openlineage.appName | Custom value overwriting Spark app name in events | AppName |
spark.openlineage.facets.disabled | Deprecated: Use the property spark.openlineage.facets<facet name>.disabled instead. List of facets to filter out from the events, enclosed in [] (required from 0.21.x) and separated by ; , default is [] | [columnLineage;] |
spark.openlineage.facets.<facet name>.disabled | If set to true, it disables the specific facet. The default value is false . The name of the facet can be hierarchical. The facets disabled by default are debug , spark.logicalPlan and spark_unknown . You have to switch the flag to false to enable them. | true |
spark.openlineage.facets.variables | List of environment variables (System.getenv() | [columnLineage;] |
spark.openlineage.capturedProperties | comma separated list of properties to be captured in spark properties facet (default spark.master , spark.app.name ) | "spark.example1,spark.example2" |
spark.openlineage.dataset.removePath.pattern | Java regular expression that removes ?<remove> named group from dataset path. Can be used to last path subdirectories from paths like s3://my-whatever-path/year=2023/month=04 | (.*)(?<remove>\/.*\/.*) |
spark.openlineage.jobName.appendDatasetName | Decides whether output dataset name should be appended to job name. By default true . | false |
spark.openlineage.jobName.replaceDotWithUnderscore | Replaces dots in job name with underscore. Can be used to mimic legacy behaviour on Databricks platform. By default false . | false |
spark.openlineage.job.owners.<ownership-type> | Specifies ownership of the job. Multiple entries with different types are allowed. Config key name and value are used to create job ownership type and name (available since 1.13). | spark.openlineage.job.owners.team="Some Team" |
spark.openlineage.job.tags | List of job-level tags. Tags are passed in a string, with key:value information separated by colon : , and tags being separated by semicolon ; | "key:value;label;another:tag" |
spark.openlineage.run.tags | List of run-level tags. Tags are passed in a string, with key:value information separated by colon : , and tags being separated by semicolon ; | "key:value;label;another:tag" |
spark.openlineage.columnLineage.datasetLineageEnabled | Makes the dataset dependencies to be included in their own property dataset in the column lineage pattern. If this flag is set to false , then the dataset dependencies are merged into fields property. The default value is false . It is recommended to set it to true | true |
spark.openlineage.vendors.iceberg.metricsReporterDisabled | Disables metrics reporter for Iceberg which turns off mechanism to collect scan and commit reports. | false |
spark.openlineage.filter.allowedSparkNodes | List of Spark plan nodes' names separated with ; and enclosed within [] . Some Spark nodes are filtered by default to not trigger OpenLineage events. This setting allows to override default behaviour and remove filtering for specified nodes. Example usage: [org.apache.spark.sql.catalyst.plans.logical.Aggregate] will enable events for Aggregate nodes | empty list |
spark.openlineage.filter.deniedSparkNodes | List of Spark plan nodes' names separated with ; and enclosed within [] . Some Spark nodes are filtered by default to not trigger OpenLineage events. This setting allows to override default behaviour and add more nodes to filter. | empty list |
spark.openlineage.timeout.buildDatasetsTimePercentage | If a timeout is set within a circuit breaker, this configures a percentage of the configured timeout that can be spent on building datasets. | empty list |
spark.openlineage.timeout.facetsBuildingTimePercentage | If a timeout is set within a circuit breaker, this configures a percentage of the configured timeout that can be spent on building facets which includes job facets, run facets, and dataset facets. This timeout applies effectively on everything besides event serialization and transport. | empty list |
spark.openlineage.disabled | Turns off OpenLineage integration, similarly to OPENLINEAGE_DISABLED environment property. Can be used when setting env property is not doable. This setting works only within Spark Conf to prevent OpenLineage from config parsing mechanism. | false |