Version: Next

Naming Conventions

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Jobs and Datasets have their own namespaces, job namespaces being derived from schedulers and dataset namespaces from datasources.

Dataset Naming

A dataset, or table, is organized according to a producer, namespace, database and (optionally) schema.

Data Store	Type	Namespace	Name
Athena	Warehouse	`awsathena://athena.{region_name}.amazonaws.com`	`{catalog}.{database}.{table}`
AWS Glue	Data catalog	`arn:aws:glue:{region}:{account id}`	`table/{database name}/{table name}`
Azure Cosmos DB	Warehouse	`azurecosmos://{host}/dbs/{database}`	`colls/{table}`
Azure Data Explorer	Warehouse	`azurekusto://{host}.kusto.windows.net`	`{database}/{table}`
Azure Synapse	Warehouse	`sqlserver://{host}:{port}`	`{schema}.{table}`
BigQuery	Warehouse	`bigquery`	`{project id}.{dataset name}.{table name}`
Cassandra	Warehouse	`cassandra://{host}:{port}`	`{keyspace}.{table}`
MySQL	Warehouse	`mysql://{host}:{port}`	`{database}.{table}`
CrateDB	Warehouse	`crate://{host}:{port}`	`{database}.{schema}.{table}`
DB2	Warehouse	`db2://{host}:{port}`	`{database}.{schema}.{table}`
Hive	Warehouse	`hive://{host}:{port}`	`{database}.{table}`
OceanBase	Warehouse	`oceanbase://{host}:{port}`	`{database}.{table}`
Oracle	Warehouse	`oracle://{host}:{port}`	`{serviceName}.{schema}.{table} or {sid}.{schema}.{table}`
Postgres	Warehouse	`postgres://{host}:{port}`	`{database}.{schema}.{table}`
Teradata	Warehouse	`teradata://{host}:{port}`	`{database}.{table}`
Redshift	Warehouse	`redshift://{cluster_identifier}.{region_name}:{port}`	`{database}.{schema}.{table}`
Snowflake	Warehouse	`snowflake://{organization name}-{account name}`	`{database}.{schema}.{table}`
Trino	Warehouse	`trino://{host}:{port}`	`{catalog}.{schema}.{table}`
ABFSS (Azure Data Lake Gen2)	Data lake	`abfss://{container name}@{service name}.dfs.core.windows.net`	`{path}`
DBFS (Databricks File System)	Distributed file system	`dbfs://{workspace name}`	`{path}`
GCS	Blob storage	`gs://{bucket name}`	`{object key}`
HDFS	Distributed file system	`hdfs://{namenode host}:{namenode port}`	`{path}`
Kafka	Distributed event streaming platform	`kafka://{bootstrap server host}:{port}`	`{topic}`
Local file system	File system	`file`	`{path}`
Remote file system	File system	`file://{host}`	`{path}`
S3	Blob Storage	`s3://{bucket name}`	`{object key}`
WASBS (Azure Blob Storage)	Blob Storage	`wasbs://{container name}@{service name}.dfs.core.windows.net`	`{object key}`
PubSub	Distributed event streaming platform	`pubsub`	`topic:{projectId}:{topicId}` or `subscription:{projectId}:{subscriptionId}`

Job Naming

A Job is a recurring data transformation with inputs and outputs. Each execution is captured as a Run with corresponding metadata. A Run event identifies the Job it instances by providing the job’s unique identifier. The Job identifier is composed of a Namespace and Name. The job namespace is usually set in OpenLineage client config. The job name is unique within its namespace.

Job type	Name	Example
Airflow task	`{dag_id}.{task_id}`	`orders_etl.count_orders`
Spark job	`{appName}.{command}.{table}`	`my_awesome_app.execute_insert_into_hive_table.mydb_mytable`
SQL	`{schema}.{table}`	`gx.validate_datasets`
Debezium	`{topic.prefix}.{taskId}`	`inventory.0`

Run Naming

Runs are named using client-generated UUIDs. The OpenLineage client is responsible for generating them and maintaining them throughout the duration of the runcycle.

from openlineage.client.run import Run
from openlineage.client.uuid import generate_new_uuid

run = Run(str(generate_new_uuid()))

Why Naming Matters

Naming enables focused insight into data flows, even when datasets and workflows are distributed across an organization. This focus enabled by naming is key to the production of useful lineage.

Dataset Naming​

Job Naming​

Run Naming​

Why Naming Matters​

Additional Resources​

Dataset Naming

Job Naming

Run Naming

Why Naming Matters

Additional Resources