Lineage
What is Lineage?
Lineage is a set of technologies designed to track the relationships between Jobs and Datasets. It does this by providing APIs, such as OpenLineage, which allow your services to submit metadata about these jobs and store it in a database, as implemented by Marquez.
With Lineage, you can observe key metadata, such as:
- Who uses a dataset as input
- Who modifies it
- Which job produced a specific version of a dataset
It's important to note that Lineage tracks metadata only, it does not provide access to the actual data itself.
In the context of Apache Spark, you can integrate Lineage using the External Spark Listener Class. This listener acts as an observer for key events such as job creation, execution, and dataset updates. When an event occurs, the listener sends the corresponding metadata to the Marquez server for tracking.
Ilum Lineage
Ilum integrates Marquez into your architecture automatically, configures jobs to make use of External Listener Class and provides its own UI to easily observe data flow in you applications
Ilum jobs with Marquez
By default Lineage is not enabled into Ilum Architecture, therefore you must enable it in Ilum.
helm upgrade \
--set global.lineage.enabled=true \
--reuse-values ilum ilum/ilum
Usually you would need to manually create a database and configure marquez to make use of it, but with Ilum this is done automatically.
Take into account: In case use are using custom credentials for your postgress databases like this:
helm upgrade \
--set postgresql.auth.username=customuser \
--set postgresql.auth.password="CHOOSE PASSWORD" \
--reuse-values ilum ilum/ilum
You must specify these credentials in marquez configurations like this:
helm upgrade \
--set ilum-marquez.marquez.db.password="CHOOSE PASSWORD" \
--set ilum-marquez.marquez.db.user=customuser \
--reuse-values ilum ilum/ilum
Spark configuration
Without Ilum you would need to past these configurations inside of each your spark session:
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.transport.type=http
spark.openlineage.transport.url=http://ilum-marquez:9555/api/v1/namespaces/ilum
Ilum does it automatically for all your Ilum Jobs and also for Sql Viewer Spark Sessions
Ilum allows you to make use of its own UI, that is connected to Marquez Backend
Usage
Lineage is a valuable tool for monitoring data flow within your application.
- It enables team members to easily identify which jobs modify a dataset and which jobs utilize it.
- Lineage provides insight into metadata associated with datasets.
- It allows you to review dataset versions and trace back to the specific job run that caused any changes. This functionality enhances your ability to troubleshoot and manage your systems effectively.
Tips
Use namespaces
Default namespace for Ilum Jobs is ilum, but you can change by setting in configurations spark parameter:
spark.openlineage.transport.url=http://ilum-marquez:9555/api/v1/namespaces/yournamespacename
Implement your custom Listener to control metadata
Spark sends metadata about jobs and datasets to the Marquez server by using class OpenLineageSparkListener which implements Spark Listener.
Problem appears in case you want to get more metadata. In this case you can implement you own Spark Listener and link it to spark sessions using
spark.extraListeners=com.example.CustomListener
Pseudo Code for such class could look like this: