Skip to main content

Lineage

What is Lineage?

Lineage is a set of technologies designed to track the relationships between Jobs and Datasets. It does this by providing APIs, such as OpenLineage, which allow your services to submit metadata about these jobs and store it in a database, as implemented by Marquez.

With Lineage, you can observe key metadata, such as:

  • Who uses a dataset as input
  • Who modifies it
  • Which job produced a specific version of a dataset

It's important to note that Lineage tracks metadata only, it does not provide access to the actual data itself.

In the context of Apache Spark, you can integrate Lineage using the External Spark Listener Class. This listener acts as an observer for key events such as job creation, execution, and dataset updates. When an event occurs, the listener sends the corresponding metadata to the Marquez server for tracking.

Ilum

Ilum Lineage

Ilum integrates Marquez into your architecture automatically, configures jobs to make use of External Listener Class and provides its own UI to easily observe data flow in you applications

Ilum jobs with Marquez

By default Lineage is not enabled into Ilum Architecture, therefore you must enable it in Ilum.

helm upgrade \
--set global.lineage.enabled=true \
--reuse-values ilum ilum/ilum

Usually you would need to manually create a database and configure marquez to make use of it, but with Ilum this is done automatically.

Take into account: In case use are using custom credentials for your postgress databases like this:

helm upgrade \
--set postgresql.auth.username=customuser \
--set postgresql.auth.password="CHOOSE PASSWORD" \
--reuse-values ilum ilum/ilum

You must specify these credentials in marquez configurations like this:

helm upgrade \
--set ilum-marquez.marquez.db.password="CHOOSE PASSWORD" \
--set ilum-marquez.marquez.db.user=customuser \
--reuse-values ilum ilum/ilum

Spark configuration

Without Ilum you would need to past these configurations inside of each your spark session:

spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://ilum-marquez:9555/api/v1/namespaces/ilum

Ilum does it automatically for all your Ilum Jobs and also for Sql Viewer Spark Sessions

Ilum allows you to make use of its own UI, that is connected to Marquez Backend

Ilum

Usage

Lineage is a valuable tool for monitoring data flow within your application.

  • It enables team members to easily identify which jobs modify a dataset and which jobs utilize it.
  • Lineage provides insight into metadata associated with datasets.
  • It allows you to review dataset versions and trace back to the specific job run that caused any changes. This functionality enhances your ability to troubleshoot and manage your systems effectively.

Ilum

Tips

Use namespaces

Default namespace for Ilum Jobs is ilum, but you can change by setting in configurations spark parameter:

spark.openlineage.transport.url=http://ilum-marquez:9555/api/v1/namespaces/yournamespacename

Implement your custom Listener to control metadata

Spark sends metadata about jobs and datasets to the Marquez server by using class OpenLineageSparkListener which implements Spark Listener.

Problem appears in case you want to get more metadata. In this case you can implement you own Spark Listener and link it to spark sessions using

spark.extraListeners=com.example.CustomListener

Pseudo Code for such class could look like this:

Open Lineage

Marquez