Skip to main content

Monitoring Apache Spark on Kubernetes

Effective observability is critical for maintaining the reliability and performance of distributed data processing systems. Ilum provides a multi-layered monitoring architecture for Apache Spark on Kubernetes, integrating native real-time insights with industry-standard open-source tools.

This guide covers the three pillars of Ilum's observability stack:

  1. Application Metrics: Real-time job statistics, JVM performance, and executor health via Ilum UI and Prometheus.
  2. Infrastructure Monitoring: Cluster resource utilization (CPU/Memory) via Kubernetes metrics.
  3. Distributed Logging: Centralized log aggregation and query capabilities using Loki and Promtail.

Ilum Native Monitoring

Ilum's built-in monitoring interface provides immediate, high-level visibility into job execution without requiring external tool configuration. It acts as the first line of defense for detecting job failures and performance regressions.

Real-time Job Metrics

For every Spark application managed by Ilum, the UI exposes granular telemetry covering resource allocation, task throughput, and memory consumption.

Key Monitoring Dimensions:

  • Resource Allocation: Tracks the count of active executors, total cores assigned, and aggregated memory usage across the cluster. This helps verify if a job is receiving the requested Kubernetes resources.
  • Task Throughput: Displays the lifecycle of Spark tasks—Running, Completed, Failed, and Interrupted. A high failure rate often indicates data quality issues or transient network faults.
  • Memory & Data I/O: Monitors peak Heap/Off-Heap memory usage and Shuffle I/O (input/output bytes). Sudden spikes in Shuffle Write can indicate expensive groupBy or join operations.
  • Executor Health: Detailed breakdown of memory pools (Execution vs. Storage) per executor, essential for tuning spark.memory.fraction.

Ilum Job Overview

Ilum Executor Statistics

Execution Timeline Analysis

Understanding the temporal structure of a Spark application is vital for performance optimization. Ilum's Timeline module visualizes the execution flow, distinguishing between high-level Spark Jobs and their constituent Stages.

  • Spark Jobs: Triggered by actions (e.g., count(), collect(), save()).
  • Stages: Boundaries created by Shuffle operations (data redistribution).

The Timeline view allows engineers to identify:

  • Long-tail Stages: Stages that take disproportionately long to complete.
  • Straggler Tasks: Individual tasks that delay the completion of a stage, often caused by data skew.

Ilum Job Timeline

Cluster Resource Utilization

Ilum aggregates metrics from the underlying Kubernetes cluster to show total resource consumption. This view is crucial for capacity planning and ensuring that the cluster has sufficient headroom for pending workloads.

Cluster Resource Overview


Apache Spark UI & History Server

While Ilum provides a consolidated view, the native Spark UI remains the standard for deep-dive DAG (Directed Acyclic Graph) analysis and SQL query planning.

Accessing the Spark UI

For running jobs, the Spark UI is proxied directly through the Ilum interface. For completed applications, Ilum bundles a Spark History Server.

Persistent Event Logging

The History Server relies on Spark Event Logs stored on a persistent volume. This integration is configured via Helm parameters:

ilum-core:
historyServer:
enabled: true
# Ensures logs persist across restarts
volume:
enabled: true
size: 10Gi

Note: To disable the History Server (e.g., in resource-constrained environments), run:

helm upgrade --set ilum-core.historyServer.enabled=false --reuse-values ilum ilum/ilum

You can access the History Server directly via port-forwarding:

kubectl port-forward svc/ilum-history-server 9666:18080

Advanced Monitoring with Prometheus

For production environments, Ilum integrates with the Prometheus ecosystem to scrape, store, and alert on time-series metrics. This integration uses the PrometheusServlet sink in Spark to expose metrics in a format compatible with Prometheus scraping.

Architecture & Configuration

Ilum automatically configures the PodMonitor custom resource (part of the Prometheus Operator) to discover Spark driver and executor pods.

Enabling Prometheus Integration:

The Prometheus stack is optional. Enable it via Helm:

helm upgrade \
--set kube-prometheus-stack.enabled=true \
--set ilum-core.job.prometheus.enabled=true \
--reuse-values ilum ilum/ilum

This configuration injects the following Spark properties into every job:

spark.ui.prometheus.enabled=true
spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus/
spark.executor.processTreeMetrics.enabled=true

Critical Spark Metrics to Monitor

When monitoring Spark on Kubernetes, focus on these high-signal metrics:

Metric CategoryMetric Name (PromQL Pattern)Why it matters
Memorymetrics_executor_heapMemoryUsed_bytesHigh heap usage (>90%) correlates with frequent Full GC and OOM risks.
Garbage Collectionmetrics_executor_jvm_G1_Young_Generation_countFrequent GC pauses freeze execution threads, reducing throughput.
Shuffle I/Ometrics_shuffle_read_bytes_totalMassive shuffle reads indicate network-heavy operations (joins/grouping).
CPU/Tasksmetrics_executor_threadpool_activeTasksShould match the number of allocated cores. Low utilization implies resource waste.

Querying with PromQL

Access the Prometheus UI via port-forward:

kubectl port-forward svc/prometheus-operated 9090:9090

Example PromQL Queries:

  • Max Memory Usage per Job:

    max(metrics_executor_maxMemory_bytes{namespace="ilum"}) by (pod)
  • Total Active Tasks in Cluster:

    sum(metrics_executor_threadpool_activeTasks)

Prometheus Metrics

Prometheus Chart


Visualization with Grafana

Ilum leverages Grafana for dashboarding, providing pre-built visualizations that correlate Spark metrics with Kubernetes infrastructure metrics.

Default Dashboards

Ilum ships with a suite of dashboards located in the Ilum folder:

  1. Ilum Spark Job Overview: High-level health check of all running jobs (Success/Failure rates, Total Cluster Memory).
  2. Ilum Spark Job Detail: Deep dive into a single application id. Correlates Driver vs. Executor memory usage.
  3. Ilum Pod Monitor: Infrastructure metrics for the Ilum control plane (CPU throttling, Memory limits).

Grafana Job Overview

Grafana Charts

Accessing Grafana

kubectl port-forward svc/ilum-grafana 8080:80
# Default Credentials: admin / prom-operator

Distributed Logging with Loki

Logs are the source of truth for debugging failures. Ilum integrates Loki (log aggregation) and Promtail (log shipping) to centralize logs from ephemeral Spark executors.

Log Architecture

  • Promtail: Runs as a DaemonSet on every Kubernetes node, tailing stdout/stderr from containers and shipping them to Loki.
  • Loki: Stores compressed logs in object storage (MinIO) and indexes metadata (labels).
  • LogQL: A query language inspired by PromQL for filtering logs.

Enable Logging Stack:

helm upgrade ilum ilum/ilum \
--set global.logAggregation.enabled=true \
--set global.logAggregation.loki.enabled=true \
--set global.logAggregation.promtail.enabled=true \
--reuse-values

Effective LogQL for Spark

Effective debugging requires filtering noise. Use these LogQL patterns to find root causes:

  • Find specific Exception traces:

    {namespace="ilum"} |= "Exception" |!= "at org.apache.spark"
  • Isolate logs for a specific Spark Application:

    {app="job-20241107-1313-driver"} 
  • Detect Container OOM Kills (System Logs):

    {namespace="ilum"} |= "OOMKilled"

Loki Log Query


Troubleshooting Guide

1. Detecting OutOfMemory (OOM) Errors

OOM errors in Spark on Kubernetes usually manifest in two ways:

  1. Java OOM (java.lang.OutOfMemoryError): The JVM runs out of Heap space.
    • Diagnosis: Search logs for java.lang.OutOfMemoryError: Java heap space.
    • Fix: Increase spark.executor.memory.
  2. Container OOM (Exit Code 137): The OS kills the container because it exceeded the Kubernetes memory limit (Off-heap usage + Overhead).
    • Diagnosis: Sudden disappearance of metrics in Grafana coupled with Reason: OOMKilled in Kubernetes events.
    • Fix: Increase spark.kubernetes.memoryOverheadFactor or spark.executor.memoryOverhead.

2. Identifying Data Skew

Data skew occurs when one partition contains significantly more data than others, causing a single task to run much longer than the rest ("Straggler").

  • Diagnosis: In the Ilum Timeline, look for a stage where the Max Task Duration is 10x+ higher than the Median Task Duration.
  • Fix: Use repartition() to redistribute data or enable Adaptive Query Execution (spark.sql.adaptive.enabled=true).

Legacy Support: Graphite

For environments already standardized on Graphite, Ilum supports pushing metrics via the GraphiteSink.

Enable Graphite Exporter:

helm upgrade ilum ilum/ilum \
--set ilum-core.job.graphite.enabled=true \
--set graphite-exporter.graphite.enabled=true \
--reuse-values

Graphite Configuration