Skip to main content

Naming Conventions

Employing a unique naming strategy per resource ensures that the spec is followed uniformly regardless of metadata producer.

Jobs and Datasets have their own namespaces, job namespaces being derived from schedulers and dataset namespaces from datasources.

Dataset Naming

A dataset, or table, is organized according to a producer, namespace, database and (optionally) schema.

Data StoreTypeNamespaceNameFormat
AthenaWarehouseHostCatalog, Database, Tableawsathena://athena.{region_name}.amazonaws.com/{catalog}.{database}.{table}
Azure Cosmos DBWarehouseHost, DatabaseSchema, Tableazurecosmos://{host}/dbs/{database}/colls/{table}
Azure Data ExplorerWarehouseHostDatabase, Tableazurekusto://{host}.kusto.windows.net/{database}/{table}
Azure SynapseWarehouseHost, Port, DatabaseSchema, Tablesqlserver://{host}:{port};database={database}/{schema}.{table}
BigQueryWarehousebigqueryProject ID, dataset, tablebigquery://{project id}.{dataset name}.{table name}
CassandraWarehouseHost, PortKeyspace, Tablecassandra://{host}:{port}/{keyspace}.{table}
MySQLWarehouseHost, PortDatabase, Tablemysql://{host}:{port}/{database}.{table}
PostgresWarehouseHost, PortDatabase, Schema, Tablepostgres://{host}:{port}/{database}.{schema}.{table}
RedshiftWarehouseHost, PortDatabase, Schema, Tableredshift://{cluster_identifier}.{region_name}:{port}/{database}.{schema}.{table}
SnowflakeWarehouseaccount identifier (composite of organization name and account name)Database, Schema, Tablesnowflake://{organization name}-{account name}/{database}.{schema}.{table}
TrinoWarehouseHost, PortCatalog, Schema, Tabletrino://{host}:{port}/{catalog}.{schema}.{table}
ABFSS (Azure Data Lake Gen2)Data lakecontainer, servicepathabfss://{container name}@{service name}/{path}
DBFS (Databricks File System)Distributed file systemworkspacepathhdfs://{workspace name}/{path}
GCSBlob storagebucketpathgs://{bucket name}/{path}
HDFSDistributed file systemNamenode host and portpathhdfs://{namenode host}:{namenode port}/{path}
Kafkadistributed event streaming platformbootstrap server host and porttopickafka://{bootstrap server host}:{port}/{topic name}
Local file systemFile systemIP, PortPathfile://{IP}:{port}/{path}
S3Blob Storagebucket namepaths3://{bucket name}/{path}
WASBS (Azure Blob Storage)Blob Storagecontainer, servicepathwasbs://{container name}@{service name}/{path}

Job Naming

A Job is a recurring data transformation with inputs and outputs. Each execution is captured as a Run with corresponding metadata. A Run event identifies the Job it instances by providing the job’s unique identifier. The Job identifier is composed of a Namespace and Name. The job name is unique within its namespace.

ProducerFormulaExample
Airflownamespace + DAG + taskairflow-staging.orders_etl.count_orders
SQLnamespace + namegx.validate_datasets

Run Naming

Runs are named using client-generated UUIDs. The OpenLineage client is responsible for generating them and maintaining them throughout the duration of the runcycle.

from openlineage.client.run import Run
run = Run(str(uuid4()))

Why Naming Matters

Naming enables focused insight into data flows, even when datasets and workflows are distributed across an organization. This focus enabled by naming is key to the production of useful lineage.

image

Additional Resources