Skip to main content

Apache Flink

Apache Flink is one of the most popular stream processing frameworks. Apache Flink jobs run on clusters, which are composed of two types of nodes: TaskManagers and JobManagers. While clusters typically consists of multiple TaskManagers, only reason to run multiple JobManagers is high availability. The jobs are submitted to JobManager by JobClient, that compiles user application into dataflow graph which is understandable by JobManager. JobManager then coordinates job execution: it splits the parallel units of a job to TaskManagers, manages heartbeats, triggers checkpoints, reacts to failures and much more.

Apache Flink has multiple deployment modes - Session Mode, Application Mode and Per-Job mode. The most popular are Session Mode and Application Mode. Session Mode consists of a JobManager managing multiple jobs sharing single Flink cluster. In this mode, JobClient is executed on a machine that submits the job to the cluster.

Application Mode is used where cluster is utilized for a single job. In this mode, JobClient, where the main method runs, is executed on the JobManager.

Flink jobs read data from Sources and write data to Sinks. In contrast to systems like Apache Spark, Flink jobs can write data to multiple places - they can have multiple Sinks.

OpenLineage utilizes Flink's JobListener interface. This interface is used by Flink to notify user of job submission, successful finish of job, or job failure. Implementations of this interface are executed on JobClient.

When OpenLineage listener receives information that job was submitted, it extracts Transformations from job's ExecutionEnvironment. The Transformations represent logical operations in the dataflow graph; they are composed of both Flink's build-in operators, but also user-provided Sources, Sinks and functions. To get the lineage, OpenLineage integration processes dataflow graph. Currently, OpenLineage is interested only in information contained in Sources and Sinks, as they are the places where Flink interacts with external systems.

After job submission, OpenLineage integration starts actively listening to checkpoints - this gives insight into whether the job runs properly.

Limitations

Currently OpenLineage's Flink integration is limited to getting information from jobs running in Application Mode.

OpenLineage integration extracts lineage only from following Sources and Sinks:

SourcesSinks
KafkaSourceKafkaSink (1)
FlinkKafkaConsumerFlinkKafkaProducer
IcebergFlinkSourceIcebergFlinkSink

We expect this list to grow as we add support for more connectors.

(1) KafkaSink supports sinks that write to a single topic as well as multi topic sinks. The limitation for multi topic sink is that: topics need to have the same schema and implementation of KafkaRecordSerializationSchema must extend KafkaTopicsDescriptor. Methods isFixedTopics and getFixedTopics from KafkaTopicsDescriptor are used to extract multiple topics from a sink.

Usage

In your job, you need to set up OpenLineageFlinkJobListener.

For example:

    JobListener listener = JobListener listener = OpenLineageFlinkJobListener.builder()
.executionEnvironment(streamExecutionEnvironment)
.build();
streamExecutionEnvironment.registerJobListener(listener);

Also, OpenLineage needs certain parameters to be set in flink-conf.yaml:

Configuration KeyDescriptionExpected ValueDefault
execution.attachedThis setting needs to be true if OpenLineage is to detect job start and failuretruefalse

OpenLineage jar needs to be present on JobManager.

When the JobListener is configured, you need to point the OpenLineage integration where the events should end up. If you're using Marquez, simplest way to do that is to set up OPENLINEAGE_URL environment variable to Marquez URL. More advanced settings are in the client documentation..

Configuring Openlineage connector

Flink Openlineage connector utilizes standard Java client for Openlineage and allows all the configuration features present there to be used. The configuration can be passed with:

  • openlineage.yml file with a environment property OPENLINEAGE_CONFIG being set and pointing to configuration file. File structure and allowed options are described here.
  • Standard Flink configuration with the parameters defined below.

The following parameters can be specified:

ParameterDefinitionExample
openlineage.transport.typeThe transport type used for event emit, default type is consolehttp
openlineage.facets.disabledList of facets to disable, enclosed in [] (required from 0.21.x) and separated by ;[some_facet1;some_facet1]

Transports

Tip: See current list of all supported transports.

HTTP

Allows sending events to HTTP endpoint, using ApacheHTTPClient.

Configuration

  • type - string, must be "http". Required.
  • url - string, base url for HTTP requests. Required.
  • endpoint - string specifying the endpoint to which events are sent, appended to url. Optional, default: /api/v1/lineage.
  • urlParams - dictionary specifying query parameters send in HTTP requests. Optional.
  • timeoutInMillis - integer specifying timeout (in milliseconds) value used while connecting to server. Optional, default: 5000.
  • auth - dictionary specifying authentication options. Optional, by default no authorization is used. If set, requires the type property.
    • type - string specifying the "api_key" or the fully qualified class name of your TokenProvider. Required if auth is provided.
    • apiKey - string setting the Authentication HTTP header as the Bearer. Required if type is api_key.
  • headers - dictionary specifying HTTP request headers. Optional.
  • compression - string, name of algorithm used by HTTP client to compress request body. Optional, default value null, allowed values: gzip. Added in v1.13.0.

Behavior

Events are serialized to JSON, and then are send as HTTP POST request with Content-Type: application/json.

Examples

Anonymous connection:

transport:
type: http
url: http://localhost:5000

With authorization:

transport:
type: http
url: http://localhost:5000
auth:
type: api_key
api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5

Full example:

transport:
type: http
url: http://localhost:5000
endpoint: /api/v1/lineage
urlParams:
param0: value0
param1: value1
timeoutInMillis: 5000
auth:
type: api_key
api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5
headers:
X-Some-Extra-Header: abc
compression: gzip

Kafka

If a transport type is set to kafka, then the below parameters would be read and used when building KafkaProducer. This transport requires the artifact org.apache.kafka:kafka-clients:3.1.0 (or compatible) on your classpath.

Configuration

  • type - string, must be "kafka". Required.

  • topicName - string specifying the topic on what events will be sent. Required.

  • properties - a dictionary containing a Kafka producer config as in Kafka producer config. Required.

  • localServerId - deprecated, renamed to messageKey since v1.13.0.

  • messageKey - string, key for all Kafka messages produced by transport. Optional, default value described below. Added in v1.13.0.

    Default values for messageKey are:

    • run:{parentJob.namespace}/{parentJob.name} - for RunEvent with parent facet
    • run:{job.namespace}/{job.name} - for RunEvent
    • job:{job.namespace}/{job.name} - for JobEvent
    • dataset:{dataset.namespace}/{dataset.name} - for DatasetEvent

Behavior

Events are serialized to JSON, and then dispatched to the Kafka topic.

Notes

It is recommended to provide messageKey if Job hierarchy is used. It can be any string, but it should be the same for all jobs in hierarchy, like Airflow task -> Spark application -> Spark task runs.

Examples

transport:
type: kafka
topicName: openlineage.events
properties:
bootstrap.servers: localhost:9092,another.host:9092
acks: all
retries: 3
key.serializer: org.apache.kafka.common.serialization.StringSerializer
value.serializer: org.apache.kafka.common.serialization.StringSerializer
messageKey: some-value

Notes: It is recommended to provide messageKey if Job hierarchy is used. It can be any string, but it should be the same for all jobs in hierarchy, like Airflow task -> Spark application.

Default values are:

  • run:{parentJob.namespace}/{parentJob.name}/{parentRun.id} - for RunEvent with parent facet
  • run:{job.namespace}/{job.name}/{run.id} - for RunEvent
  • job:{job.namespace}/{job.name} - for JobEvent
  • dataset:{dataset.namespace}/{dataset.name} - for DatasetEvent

Kinesis

If a transport type is set to kinesis, then the below parameters would be read and used when building KinesisProducer. Also, KinesisTransport depends on you to provide artifact com.amazonaws:amazon-kinesis-producer:0.14.0 or compatible on your classpath.

Configuration

  • type - string, must be "kinesis". Required.
  • streamName - the streamName of the Kinesis. Required.
  • region - the region of the Kinesis. Required.
  • roleArn - the roleArn which is allowed to read/write to Kinesis stream. Optional.
  • properties - a dictionary that contains a Kinesis allowed properties. Optional.

Behavior

  • Events are serialized to JSON, and then dispatched to the Kinesis stream.
  • The partition key is generated as {jobNamespace}:{jobName}.
  • Two constructors are available: one accepting both KinesisProducer and KinesisConfig and another solely accepting KinesisConfig.

Examples

transport:
type: kinesis
streamName: your_kinesis_stream_name
region: your_aws_region
roleArn: arn:aws:iam::account-id:role/role-name
properties:
VerifyCertificate: true
ConnectTimeout: 6000

Console

This straightforward transport emits OpenLineage events directly to the console through a logger. No additional configuration is required.

Behavior

Events are serialized to JSON. Then each event is logged with INFO level to logger with name ConsoleTransport.

Notes

Be cautious when using the DEBUG log level, as it might result in double-logging due to the OpenLineageClient also logging.

Configuration

  • type - string, must be "console". Required.

Examples

transport:
type: console

File

Designed mainly for integration testing, the FileTransport emits OpenLineage events to a given file.

Configuration

  • type - string, must be "file". Required.
  • location - string specifying the path of the file. Required.

Behavior

  • If the target file is absent, it's created.
  • Events are serialized to JSON, and then appended to a file, separated by newlines.
  • Intrinsic newline characters within the event JSON are eliminated to ensure one-line events.

Notes for Yarn/Kubernetes

This transport type is pretty useless on Spark/Flink applications deployed to Yarn or Kubernetes cluster:

  • Each executor will write file to a local filesystem of Yarn container/K8s pod. So resulting file will be removed when such container/pod is destroyed.
  • Kubernetes persistent volumes are not destroyed after pod removal. But all the executors will write to the same network disk in parallel, producing a broken file.

Examples

transport:
type: file
location: /path/to/your/file

Circuit Breakers

info

This feature is available in OpenLineage versions >= 1.9.0.

To prevent from over-instrumentation OpenLineage integration provides a circuit breaker mechanism that stops OpenLineage from creating, serializing and sending OpenLineage events.

Simple Memory Circuit Breaker

Simple circuit breaker which is working based only on free memory within JVM. Configuration should contain free memory threshold limit (percentage). Default value is 20%. The circuit breaker will close within first call if free memory is low. circuitCheckIntervalInMillis parameter is used to configure a frequency circuit breaker is called. Default value is 1000ms, when no entry in config. timeoutInSeconds is optional. If set, OpenLineage code execution is terminated when a timeout is reached (added in version 1.13).

circuitBreaker:
type: simpleMemory
memoryThreshold: 20
circuitCheckIntervalInMillis: 1000
timeoutInSeconds: 90

Java Runtime Circuit Breaker

More complex version of circuit breaker. The amount of free memory can be low as long as amount of time spent on Garbage Collection is acceptable. JavaRuntimeCircuitBreaker closes when free memory drops below threshold and amount of time spent on garbage collection exceeds given threshold (10% by default). The circuit breaker is always open when checked for the first time as GC threshold is computed since the previous circuit breaker call. circuitCheckIntervalInMillis parameter is used to configure a frequency circuit breaker is called. Default value is 1000ms, when no entry in config. timeoutInSeconds is optional. If set, OpenLineage code execution is terminated when a timeout is reached (added in version 1.13).

circuitBreaker:
type: javaRuntime
memoryThreshold: 20
gcCpuThreshold: 10
circuitCheckIntervalInMillis: 1000
timeoutInSeconds: 90

Custom Circuit Breaker

List of available circuit breakers can be extended with custom one loaded via ServiceLoader with own implementation of io.openlineage.client.circuitBreaker.CircuitBreakerBuilder.