Monitoring

Ilum offers both basic and advanced monitoring capabilities by exposing detailed metrics for each Ilum job and associated clusters directly within the Ilum UI. Additionally, Ilum integrates with Prometheus and Graphite to collect metrics across the entire cluster, utilizes Grafana for visualizing these metrics with in-depth dashboards, and incorporates Loki and Promtail to efficiently manage logs.

Ilum Metrics

Ilum collects key information about each job and individual cluster, making it accessible to you through the Ilum UI.

Ilum Job Metrics

For each Ilum job, you can view the following details:

Allocated Resources: The number of executors assigned to the job, the memory consumed by executors and the driver, and the number of CPUs and cores in use.
Task Statistics: The total number of tasks, along with the count of currently running, completed, failed, or interrupted tasks.
Data and Memory Statistics: Peak heap and storage memory usage, as well as the number of rows and bytes processed as input and output for the entire job and between individual stages.
Logs: Access logs from the Spark driver and each executor directly within the Ilum UI.
Executor Statistics: Memory usage details for each executor, broken down by memory type.

Ilum

Ilum Job Timeline

In Spark, each application is divided into jobs and stages.

Spark Jobs are created when an action—such as collect, count, or write—is executed. Each of these actions triggers a separate Spark Job. Note: When we refer to Spark Jobs, we’re discussing Spark’s internal job concept, which is distinct from Ilum jobs.

Stages are generated whenever a shuffle operation occurs. Shuffle operations involve redistributing data across partitions, which cannot be performed within a single partition. Examples of these operations include groupByKey and join.

The Timeline module displays all Spark Jobs executed within an Ilum job, showing the stages contained in each Spark Job and their placement on a timeline. Additionally, for each stage, it provides detailed metrics such as execution time, memory usage, and input/output data size (in rows and bytes).

Ilum

Clusters and Workload Overview

Ilum collects metrics for each added cluster both individually and in aggregate, tracking used and requested CPU and memory.

Additionally, you can monitor detailed Ilum job statistics:

Ilum

How metrics are gathered?

All the metrics for individual Ilum Jobs are gathered using Ilum History Server.

Ilum Jobs Statistics are gathered from Ilum Server.

Cluster metrics are gathered directly from theirs APIs

Spark UI

Every Spark job provides a web interface that you can use to monitor the status and resource consumption of your Spark job. This UI is accessible from Ilum's UI level on job details page.

Spark History Server

To be able to access the UIs of completed Spark jobs, the ilum package includes a Spark History Server. The history server uses the default kubernetes storage configured with the ilum-core.kubernetes.* helm parameters to store spark events logs. It is very important to provide access to this storage for each Spark job cluster that Ilum uses. The Spark history server is by default bundled in ilum package.

You can disable spark history server using a helm upgrade command. For instance:

helm upgrade \
	--set ilum-core.historyServer.enabled=false \
	--reuse-values ilum ilum/ilum

You can access Spark History Server from Ilum UI or using the port-forward command:

kubectl port-forward svc/ilum-history-server 9666:9666

All Ilum Jobs are preconfigured to write all of the information to the event logs:

spark.eventLog.enabled=true
spark.eventLog.dir=<event-log-storage-path>

Prometheus

Overview

Prometheus is a sophisticated system designed for monitoring and alerting. It collects metrics from specified endpoints and organizes them into time series data.

For example, Prometheus can capture data on CPU tasks, as well as CPU and memory usage for Ilum Jobs, which are accessible through HTTP endpoints. Prometheus pulls this data every five seconds and stores it in an internal database, using indexing and other optimization techniques to ensure queries remain efficient and lightweight.

Time series data can be retrieved using metric names and labels in the following format:

metric_name{metric_label=metric_value,...}

You can query the data using PromQL (Prometheus Query Language) directly from the Prometheus server interface. For more details on PromQL, visit the official Prometheus documentation page

Deployment

Prometheus is included into Ilum with kube prometheus stack. Take into account, that it is not enabled by default. You can enable it using a helm upgrade command. For instance:

helm upgrade \
--set kube-prometheus-stack.enabled=true \
--set ilum-core.job.prometheus.enabled=true \
--reuse-values ilum ilum/ilum

Kube prometheus stack also includes Grafana. Moreover, you can include Alertmanager by changing helm value:

helm upgrade \
--set kube-prometheus-stack.alertmanager.enabled=true \
--reuse-values ilum ilum/ilum

With this configuration prometheus will scrape metrics from all Spark jobs run by Ilum. In order to do this Ilum preconfigures PodMonitors to tell Prometheus to scrape metrics from Ilum Job Drivers and Core components.

Ilum Jobs are preconfigured to make their metrics available through an endpoint like this:

spark.ui.prometheus.enabled=true
spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus/
spark.executor.processTreeMetrics.enabled=true
spark.executor.metrics.pollingInterval=5s

You can access Prometheus UI using the port-forward command:

kubectl port-forward svc/prometheus-operated 9090:9090

Usage Example

Go to Prometheus UI. Here can see table - environment, where you can execure PromQL queries

Type metrics into the input and you will be able to see all the metrics, that Prometheus pulled from Spark Apps

Ilum

For example, you can type these queries to get different views on max memory used by executors:

#max memory used by each executor of each Ilum Job
metrics_executor_maxMemory_bytes

#max memory used in a given cluster namespace 
max(metrics_executor_maxMemory_bytes{namespace="ilum"})

#other aggreagtions of maxMemory 
min(metrics_executor_maxMemory_bytes)
avg(metrics_executor_maxMemory_bytes)


#max memory for specific job:
metrics_executor_maxMemory_bytes{pod="job-<id>-driver"}

Ilum

Grafana

Overview

Grafana is a powerful and sophisticated data visualization tool that enables users to create a wide variety of interactive charts and dashboards. It is capable of fetching real-time data from multiple data sources through preconfigured plugins, making it a highly flexible and scalable solution for monitoring and analysis.

In the context of system and application monitoring, Grafana dashboards often integrate with popular data sources such as Prometheus for metric collection and Loki for log aggregation. This combination allows for seamless visualization of both metrics and logs in a unified interface, helping users quickly identify performance issues, troubleshoot problems, and gain insights into the overall health of their infrastructure and applications.

Deployment

Kube Prometheus Stack includes Grafana which uses metrics scraped by prometheus or pushed to graphite to feed dashboards with data that can give better overview of your spark jobs. Ilum provides default dashboards to monitor spark jobs and ilum pods. They can be found in Ilum folder of grafana's dashboards.

You can access Grafana UI using the port-forward command:

kubectl port-forward svc/ilum-grafana 8080:80

and using admin:prom-operator as default credentials

Usage Examples

Add the Datasource you want to use.

For example, to add Loki datasource, you only need to provide link to Loki: http://ilum-loki-gateway

Create a Dashboard, choose desired chart type

For example, we will choose preconfigured Prometheus Datasource and Pie Chart for visualization

Ilum

Create a Query you want to use

You can do this by constructor provided by Prometheus plugin. For example, you can find metrics using metric-explorer:

Ilum

In this example we will use metrics_executor_maxMemory_bytes

After that we will set label namespace to take timeseries related only to current namespace.

Finally, we click run and see the results:

Ilum

Preconfigured Dashboards

Preconfigured Dashboards gather information both in a different way in comparison to ones embedded in Ilum UI and in the same way.

You can find them by going to Dashboards > Ilum folder

Ilum Spark Jobs Overview

It contains information summed from all Ilum Jobs across a cluster:

Allocated resources: CPU and cores
Input, Disk and Runtime memory usage
Task statistics: active, failed, interrupted and finished spark tasks
JVM memory statistics

Ilum Spark Job

It contains the same information, but about individual Ilum Jobs

Ilum Graphite Spark Job

If contains information about individual Ilum Jobs gathered with Graphite

Ilum Pod

This dashboard allows you to observe statistics about Ilum-Core

cpu and memory usage, load and timing
JVM statistics
http statistics

Ilum

Graphite

Overview

Graphite is an alternative solution for gathering time-series data, including metrics.

Graphite uses push-based model, therefore Ilum preconfigures spark sessions to push all the metrics to Graphite server like this:

Ilum

Deployment

Graphite is not enabled in Ilum by default. You can enable it with this helm command:

helm upgrade ilum ilum/ilum \
--set ilum-core.job.graphite.enabled=true \
--set graphite-exporter.graphite.enabled=true \
--reuse-values

Usage

Graphite integrates seamlessly with Grafana, and Ilum simplifies this process for you. Ilum provides a pre-built dashboard that displays comprehensive metrics information for individual jobs, including:

Task statistics
Read and write statistics
Memory and CPU statistics
JVM statistics

Grafana uses the Prometheus Data Source to query metrics from Graphite.

To make use of the dashboard follow these steps:

Enable the kube prometheus stack, if it is not enabled yet

helm upgrade \
--set kube-prometheus-stack.enabled=true \
--reuse-values ilum ilum/ilum

Visit Grafana UI Run this command:

kubectl port-forward svc/ilum-grafana 8080:80

Go to localhost:8080 and use admin:prom-operator as default credentials

Go to Ilum Graphite Spark Job Dashboard and type the id of Ilum Job, that you are interested in

Loki and Promtail

Overview

Loki is a log aggregation tool integrated into the Ilum infrastructure due to its cost-effectiveness and scalability. Key Features of Loki:

Each component in the Loki architecture can be scaled independently based on specific needs.
Logs are compressed to save storage space.
Only metadata about the logs is indexed, not the logs themselves.
Loki is a push-based system: it does not pull logs, but rather requires agents to push them.
LogQL is the query language used to retrieve logs, which is similar to PromQL.

In the Ilum infrastructure, Loki is configured to store all log files and indexed metadata on Ilum-minio, the default storage.

Promtail is an agent responsible for collecting application logs and pushing them into Loki. The log collection process can be configured in the scrape_configs section. For example, you can configure a pipeline to define log sources, transformations, modifications, and filtering rules.

In Ilum, Promtail is preconfigured to listen to each Ilum Service and Ilum Job.

Deployment and configurations

To make use of Loki and Promtail you can run helm upgrade like this:

helm upgrade ilum ilum/ilum \
	--set global.logAggregation.enabled=true \
	--set global.logAggregation.loki.enabled=true \
	--set global.logAggregation.promtail.enabled=true \
	--reuse-values

LogQL

To retrieve queries saved to Loki, you can use LogQL language. In LogQL all logs are divided into logs streams. Each log stream is represented by its label set. For example, it can be a set of labels that identify application :

{app="job-20241107-1313-9vos3-1972g-9f73c39306c1c7b0-driver", namespace="ilum"}

You can use such labels as app, container, pod, namespace, ilum_jobId, ilum_groupId, spark_version, spark_app_name etc

When working with a log stream, you can apply aggregations functions:

{label:"value"} | max_over_time([1h])
count_over_time({label:"value"}[1h])

Lines matching:

#to retrieve lines with "INFO" in it 
{label:"value"} |= "INFO"

#to exclude such lines
{label:"value"} != "INFO"

And even lines transformation

LogQL in Ilum Example

When working with Ilum Code Groups, the logs with used code are generated. Let's retrieve these logs

Retireve logs of the group:

{ilum_groupId="20241107-1313-9vos3"}

Filter out logs, that are not related to executed code:

{ilum_groupId="20241107-1313-9vos3"} |= "CODE_EXECUTE"

Remove non-code text, so that only raw code would be left:

{ilum_groupId="20241107-1313-9vos3"} |= "CODE_EXECUTE"  | json | line_format "{{ .log | substr 149 -1 | replace \",CODE_EXECUTE)\" \"\"}}"

Using Loki API

To execute read queries using Loki API you can send this GET request:

curl -G -s "http://<loki-endpoint>/loki/api/v1/query" \
--data-urlencode 'query={label1=value1}'

For example to run query from the example above, run:

kubectl port-forward svc/ilum-loki-gateway 8000:80

curl -G -s "http://localhost:8000/loki/api/v1/query" \
--data-urlencode '{app="job-20241107-1313-9vos3-1972g-9f73c39306c1c7b0-driver"} |= "CODE_EXECUTE" | json | line_format "{{ .log | substr 149 -1 | replace \",CODE_EXECUTE)\" \"\"}}"'

Using Loki with Grafana

Enter Grafana UI: Type into command line

kubectl port-forward svc/ilum-grafana 8080:80

Enter localhost:8080 and use admin:prom-operator as default credentials

Add Loki Datasource To do this you need to add http://ilum-loki-gateway as a link to Loki
Create a dashboard:

choose Loki datasource
choose Labels to retrieve log stream of your choise
configure intervals and other parameters
mofidy the query code the way you want

Final result for the example above can look like this:

Ilum

Ilum Metrics​

Ilum Job Metrics​

Ilum Job Timeline​

Clusters and Workload Overview​

How metrics are gathered?​

Spark UI​

Spark History Server​

Prometheus​

Overview​

Deployment​

Usage Example​

Grafana​

Overview​

Deployment​

Usage Examples​

Preconfigured Dashboards​

Graphite​

Overview​

Deployment​

Usage​

Loki and Promtail​

Overview​

Deployment and configurations​

LogQL​

LogQL in Ilum Example​

Using Loki API​

Using Loki with Grafana​

Links​

Ilum Metrics

Ilum Job Metrics

Ilum Job Timeline

Clusters and Workload Overview

How metrics are gathered?

Spark UI

Spark History Server

Prometheus

Overview

Deployment

Usage Example

Grafana

Overview

Deployment

Usage Examples

Preconfigured Dashboards

Graphite

Overview

Deployment

Usage

Loki and Promtail

Overview

Deployment and configurations

LogQL

LogQL in Ilum Example

Using Loki API

Using Loki with Grafana

Links