Skip to main content

Naming Conventions

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Jobs and Datasets have their own namespaces, job namespaces being derived from schedulers and dataset namespaces from datasources.

Dataset Naming

A dataset, or table, is organized according to a producer, namespace, database and (optionally) schema.

Data StoreTypeNamespaceName
AthenaWarehouseawsathena://athena.{region_name}.amazonaws.com{catalog}.{database}.{table}
Azure Cosmos DBWarehouseazurecosmos://{host}/dbs/{database}colls/{table}
Azure Data ExplorerWarehouseazurekusto://{host}.kusto.windows.net{database}/{table}
Azure SynapseWarehousesqlserver://{host}:{port}{schema}.{table}
BigQueryWarehousebigquery://{project id}.{dataset name}.{table name}
CassandraWarehousecassandra://{host}:{port}{keyspace}.{table}
MySQLWarehousemysql://{host}:{port}{database}.{table}
OracleWarehouseoracle://{host}:{port}{serviceName}.{schema}.{table} or {sid}.{schema}.{table}
PostgresWarehousepostgres://{host}:{port}{database}.{schema}.{table}
TeradataWarehouseteradata://{host}:{port}{database}.{table}
RedshiftWarehouseredshift://{cluster_identifier}.{region_name}:{port}{database}.{schema}.{table}
SnowflakeWarehousesnowflake://{organization name}-{account name}{database}.{schema}.{table}
TrinoWarehousetrino://{host}:{port}{catalog}.{schema}.{table}
ABFSS (Azure Data Lake Gen2)Data lakeabfss://{container name}@{service name}{path}
DBFS (Databricks File System)Distributed file systemhdfs://{workspace name}{path}
GCSBlob storagegs://{bucket name}{object key}
HDFSDistributed file systemhdfs://{namenode host}:{namenode port}{path}
Kafkadistributed event streaming platformkafka://{bootstrap server host}:{port}{topic}
Local file systemFile systemfile://{host}{path}
S3Blob Storages3://{bucket name}{object key}
WASBS (Azure Blob Storage)Blob Storagewasbs://{container name}@{service name}{object key}

Job Naming

A Job is a recurring data transformation with inputs and outputs. Each execution is captured as a Run with corresponding metadata. A Run event identifies the Job it instances by providing the job’s unique identifier. The Job identifier is composed of a Namespace and Name. The job namespace is usually set in OpenLineage client config. The job name is unique within its namespace.

Job typeNameExample
Airflow task{dag_id}.{task_id}orders_etl.count_orders
Spark job{appName}.{command}.{table}my_awesome_app.execute_insert_into_hive_table.mydb_mytable
SQL{schema}.{table}gx.validate_datasets

Run Naming

Runs are named using client-generated UUIDs. The OpenLineage client is responsible for generating them and maintaining them throughout the duration of the runcycle.

from openlineage.client.run import Run
from openlineage.client.uuid import generate_new_uuid
run = Run(str(generate_new_uuid()))

Why Naming Matters

Naming enables focused insight into data flows, even when datasets and workflows are distributed across an organization. This focus enabled by naming is key to the production of useful lineage.

image

Additional Resources