Open Lineage
OpenLineage is an open standard for tracking data lineage across processing systems. It standardizes the collection of metadata about data pipelines, enabling better visibility, debugging, and governance of data workflows.
Dataedo provides a public API with a dedicated lineage endpoint. When tools like Apache Airflow and Apache Spark are configured to emit OpenLineage events, those events are captured by Dataedo and stored in the open_lineage_events
table.
The collected events can then be imported and analyzed using Dataedo's OpenLineage connector, offering powerful lineage visualization and insights into your data pipelines.
Catalog and Documentation
Dataedo imports jobs, input datasets, and output datasets extracted from OpenLineage events that have a status of COMPLETE and are successfully sent to the Dataedo Public API. These events are then saved in the open_lineage_events
table in the Dataedo repository.
Jobs
Overview
On the overview page, you will see basic information about the job, such as Job Namespace (in the Schema field) and Job Name.

Script
If the job has a script, it will be visible in the Script tab.

Data Lineage
If the Run Event contains information about lineage, it will be visible in the Data Lineage tab.

Input Datasets
Overview
On the overview page, you will see basic information about the dataset, such as Dataset Namespace (in the Schema field) and Dataset Name.

Fields
If the dataset has fields, they will be visible in the Columns tab.

Data Lineage
If the Run Event contains information about lineage, it will be visible in the Data Lineage tab.

Output Datasets
Overview
On the overview page, you will see basic information about the dataset, such as Dataset Namespace (in the Schema field) and Dataset Name.

Fields
If the dataset has fields, they will be visible in the Columns tab.

Data Lineage
If the Run Event contains information about lineage, it will be visible in the Data Lineage tab.

Specification
Imported Metadata
Dataedo reads the following metadata from OpenLineage events:
Imported | Editable | |
---|---|---|
RunEvent | ✅ | |
Inputs | ✅ | |
Fields | ✅ | |
Outputs | ✅ | |
Fields | ✅ | |
Input Fields | ✅ |
Configuration and Import
To enable OpenLineage events gathering, it is required to enable the Dataedo Public API and get a token. Next, you need to configure the OpenLineage events emitter. The OpenLineage events will be stored in the open_lineage_events
table in the Dataedo repository. To process events, you need to run the OpenLineage connector import.
Configuration of Dataedo Public API
To enable the Dataedo Public API, follow the steps from the article: Dataedo Public API Authorization
Apache Airflow Configuration
To enable emitting OpenLineage events, follow the official documentation: Apache Airflow OpenLineage provider configuration
Example configuration file for Airflow OpenLineage configuration:
transport:
type: http
url: {YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}
endpoint: public/v1/lineage
auth:
type: api_key
apiKey: {API_KEY_GENERATED_IN_DATAEDO_PORTAL}
Apache Spark Configuration
To enable emitting OpenLineage events, follow the official documentation: Quickstart with Jupyter
Example configuration of Spark Session:
from pyspark.sql import SparkSession
spark = (SparkSession.builder.master('local')
.appName('sample_spark')
.config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
.config('spark.jars.packages', 'io.openlineage:openlineage-spark:1.28.0')
.config('spark.openlineage.transport.type', 'http')
.config("spark.openlineage.transport.url", "{YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}")
.config("spark.openlineage.transport.endpoint", "/public/v1/lineage")
.config("spark.openlineage.transport.auth.type", "api_key")
.config("spark.openlineage.transport.auth.apiKey", "{API_KEY_GENERATED_IN_DATAEDO_PORTAL}")
.config('spark.openlineage.columnLineage.datasetLineageEnabled', 'true')
.getOrCreate())
Apache Spark on Databricks
To enable emitting OpenLineage events, follow the official documentation: Quickstart with Databricks
Example configuration of Spark Session:
spark.conf.set("spark.openlineage.columnLineage.datasetLineageEnabled", "true")
spark.conf.set("spark.openlineage.transport.url", "{YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}")
spark.conf.set("spark.openlineage.transport.endpoint", "/public/v1/lineage")
spark.conf.set("spark.openlineage.transport.auth.type", "api_key")
spark.conf.set("spark.openlineage.transport.auth.apiKey", "{API_KEY_GENERATED_IN_DATAEDO_PORTAL}")
spark.conf.set("spark.openlineage.transport.type", "http")
Apache Spark on AWS Glue
To enable emitting OpenLineage events, follow the official documentation: Quickstart with AWS Glue
Example configuration of Job:
--conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
--conf spark.openlineage.transport.type=http
--conf spark.openlineage.transport.url={YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}
--conf spark.openlineage.transport.endpoint=/api/v1/lineage
--conf spark.openlineage.columnLineage.datasetLineageEnabled=true
--conf spark.openlineage.transport.auth.apiKey={API_KEY_GENERATED_IN_DATAEDO_PORTAL}
--conf spark.openlineage.transport.endpoint=/public/v1/lineage
--conf spark.openlineage.transport.auth.type=api_key
Processing OpenLineage Events with Dataedo OpenLineage Connector
To process OpenLineage events stored in the Dataedo repository, select Add source -> New connection. On the connectors list, select OpenLineage.

Select the number of last days to analyze. Click Connect and go through the import process.

If you have several OpenLineage producers with different namespaces, you can import them in separate data sources by filtering the namespace in the Dataedo Schema field.
