Monitoring Apache Spark on Kubernetes
Effective observability is critical for maintaining the reliability and performance of distributed data processing systems. Ilum provides a multi-layered monitoring architecture for Apache Spark on Kubernetes, integrating native real-time insights with industry-standard open-source tools.
This guide covers the three pillars of Ilum's observability stack:
- Application Metrics: Real-time job statistics, JVM performance, and executor health via Ilum UI and Prometheus.
- Infrastructure Monitoring: Cluster resource utilization (CPU/Memory) via Kubernetes metrics.
- Distributed Logging: Centralized log aggregation and query capabilities using Loki and Promtail.
Ilum Native Monitoring
Ilum's built-in monitoring interface provides immediate, high-level visibility into job execution without requiring external tool configuration. It acts as the first line of defense for detecting job failures and performance regressions.
Real-time Job Metrics
For every Spark application managed by Ilum, the UI exposes granular telemetry covering resource allocation, task throughput, and memory consumption.
Key Monitoring Dimensions:
- Resource Allocation: Tracks the count of active executors, total cores assigned, and aggregated memory usage across the cluster. This helps verify if a job is receiving the requested Kubernetes resources.
- Task Throughput: Displays the lifecycle of Spark tasks—Running, Completed, Failed, and Interrupted. A high failure rate often indicates data quality issues or transient network faults.
- Memory & Data I/O: Monitors peak Heap/Off-Heap memory usage and Shuffle I/O (input/output bytes). Sudden spikes in Shuffle Write can indicate expensive
groupByorjoinoperations. - Executor Health: Detailed breakdown of memory pools (Execution vs. Storage) per executor, essential for tuning
spark.memory.fraction.


Execution Timeline Analysis
Understanding the temporal structure of a Spark application is vital for performance optimization. Ilum's Timeline module visualizes the execution flow, distinguishing between high-level Spark Jobs and their constituent Stages.
- Spark Jobs: Triggered by actions (e.g.,
count(),collect(),save()). - Stages: Boundaries created by Shuffle operations (data redistribution).
The Timeline view allows engineers to identify:
- Long-tail Stages: Stages that take disproportionately long to complete.
- Straggler Tasks: Individual tasks that delay the completion of a stage, often caused by data skew.

Cluster Resource Utilization
Ilum aggregates metrics from the underlying Kubernetes cluster to show total resource consumption. This view is crucial for capacity planning and ensuring that the cluster has sufficient headroom for pending workloads.

Apache Spark UI & History Server
While Ilum provides a consolidated view, the native Spark UI remains the standard for deep-dive DAG (Directed Acyclic Graph) analysis and SQL query planning.
Accessing the Spark UI
For running jobs, the Spark UI is proxied directly through the Ilum interface. For completed applications, Ilum bundles a Spark History Server.
Persistent Event Logging
The History Server relies on Spark Event Logs stored on a persistent volume. This integration is configured via Helm parameters:
ilum-core:
historyServer:
enabled: true
# Ensures logs persist across restarts
volume:
enabled: true
size: 10Gi
Note: To disable the History Server (e.g., in resource-constrained environments), run:
helm upgrade --set ilum-core.historyServer.enabled=false --reuse-values ilum ilum/ilum
You can access the History Server directly via port-forwarding:
kubectl port-forward svc/ilum-history-server 9666:18080
Advanced Monitoring with Prometheus
For production environments, Ilum integrates with the Prometheus ecosystem to scrape, store, and alert on time-series metrics. This integration uses the PrometheusServlet sink in Spark to expose metrics in a format compatible with Prometheus scraping.
Architecture & Configuration
Ilum automatically configures the PodMonitor custom resource (part of the Prometheus Operator) to discover Spark driver and executor pods.
Enabling Prometheus Integration:
The Prometheus stack is optional. Enable it via Helm:
helm upgrade \
--set kube-prometheus-stack.enabled=true \
--set ilum-core.job.prometheus.enabled=true \
--reuse-values ilum ilum/ilum
This configuration injects the following Spark properties into every job:
spark.ui.prometheus.enabled=true
spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus/
spark.executor.processTreeMetrics.enabled=true
Critical Spark Metrics to Monitor
When monitoring Spark on Kubernetes, focus on these high-signal metrics:
| Metric Category | Metric Name (PromQL Pattern) | Why it matters |
|---|---|---|
| Memory | metrics_executor_heapMemoryUsed_bytes | High heap usage (>90%) correlates with frequent Full GC and OOM risks. |
| Garbage Collection | metrics_executor_jvm_G1_Young_Generation_count | Frequent GC pauses freeze execution threads, reducing throughput. |
| Shuffle I/O | metrics_shuffle_read_bytes_total | Massive shuffle reads indicate network-heavy operations (joins/grouping). |
| CPU/Tasks | metrics_executor_threadpool_activeTasks | Should match the number of allocated cores. Low utilization implies resource waste. |
Querying with PromQL
Access the Prometheus UI via port-forward:
kubectl port-forward svc/prometheus-operated 9090:9090
Example PromQL Queries:
-
Max Memory Usage per Job:
max(metrics_executor_maxMemory_bytes{namespace="ilum"}) by (pod) -
Total Active Tasks in Cluster:
sum(metrics_executor_threadpool_activeTasks)


Visualization with Grafana
Ilum leverages Grafana for dashboarding, providing pre-built visualizations that correlate Spark metrics with Kubernetes infrastructure metrics.
Default Dashboards
Ilum ships with a suite of dashboards located in the Ilum folder:
- Ilum Spark Job Overview: High-level health check of all running jobs (Success/Failure rates, Total Cluster Memory).
- Ilum Spark Job Detail: Deep dive into a single application id. Correlates Driver vs. Executor memory usage.
- Ilum Pod Monitor: Infrastructure metrics for the Ilum control plane (CPU throttling, Memory limits).


Accessing Grafana
kubectl port-forward svc/ilum-grafana 8080:80
# Default Credentials: admin / prom-operator
Distributed Logging with Loki
Logs are the source of truth for debugging failures. Ilum integrates Loki (log aggregation) and Promtail (log shipping) to centralize logs from ephemeral Spark executors.
Log Architecture
- Promtail: Runs as a DaemonSet on every Kubernetes node, tailing stdout/stderr from containers and shipping them to Loki.
- Loki: Stores compressed logs in object storage (MinIO) and indexes metadata (labels).
- LogQL: A query language inspired by PromQL for filtering logs.
Enable Logging Stack:
helm upgrade ilum ilum/ilum \
--set global.logAggregation.enabled=true \
--set global.logAggregation.loki.enabled=true \
--set global.logAggregation.promtail.enabled=true \
--reuse-values
Effective LogQL for Spark
Effective debugging requires filtering noise. Use these LogQL patterns to find root causes:
-
Find specific Exception traces:
{namespace="ilum"} |= "Exception" |!= "at org.apache.spark" -
Isolate logs for a specific Spark Application:
{app="job-20241107-1313-driver"} -
Detect Container OOM Kills (System Logs):
{namespace="ilum"} |= "OOMKilled"

Troubleshooting Guide
1. Detecting OutOfMemory (OOM) Errors
OOM errors in Spark on Kubernetes usually manifest in two ways:
- Java OOM (
java.lang.OutOfMemoryError): The JVM runs out of Heap space.- Diagnosis: Search logs for
java.lang.OutOfMemoryError: Java heap space. - Fix: Increase
spark.executor.memory.
- Diagnosis: Search logs for
- Container OOM (Exit Code 137): The OS kills the container because it exceeded the Kubernetes memory limit (Off-heap usage + Overhead).
- Diagnosis: Sudden disappearance of metrics in Grafana coupled with
Reason: OOMKilledin Kubernetes events. - Fix: Increase
spark.kubernetes.memoryOverheadFactororspark.executor.memoryOverhead.
- Diagnosis: Sudden disappearance of metrics in Grafana coupled with
2. Identifying Data Skew
Data skew occurs when one partition contains significantly more data than others, causing a single task to run much longer than the rest ("Straggler").
- Diagnosis: In the Ilum Timeline, look for a stage where the Max Task Duration is 10x+ higher than the Median Task Duration.
- Fix: Use
repartition()to redistribute data or enable Adaptive Query Execution (spark.sql.adaptive.enabled=true).
Legacy Support: Graphite
For environments already standardized on Graphite, Ilum supports pushing metrics via the GraphiteSink.
Enable Graphite Exporter:
helm upgrade ilum ilum/ilum \
--set ilum-core.job.graphite.enabled=true \
--set graphite-exporter.graphite.enabled=true \
--reuse-values
