Skip to main content

Get Started

This guide covers deploying Ilum on Kubernetes and submitting your first Spark job.

Installation Architecture

Ilum follows a modular architecture where core Spark execution capabilities are separated from optional data platform features. The base installation provides:

  • Spark 3.x/4.x job orchestration on Kubernetes
  • Jupyter notebook integration
  • REST API for job submission

Additional modules enable enterprise data platform capabilities:

  • Hive Metastore: Centralized metadata management for tables and schemas
  • SQL Viewer: Interactive SQL query interface with result caching
  • Data Lineage: OpenLineage-based tracking of data transformations across jobs
  • Monitoring: Grafana dashboards with Spark metrics and resource utilization

Resource Planning:

  • Base deployment (Spark + Jupyter): 8-12GB RAM, 6 CPU cores
  • With metadata + lineage modules: 18GB RAM, 12 CPU cores
  • Production workloads: Size based on concurrent Spark executor requirements

Module selection impacts pod count, storage IOPS, and network traffic. Each module runs in dedicated pods with configurable resource limits.

Prerequisites

In order to run Ilum on your machine, you'll need the following:

Kubernetes Cluster

Ilum deploys exclusively on Kubernetes using Helm charts. Any CNCF-compliant Kubernetes distribution works:

Supported Platforms:

  • Local development: Minikube, Microk8s, K3s, Docker Desktop
  • Cloud-managed: GKE, EKS, AKS, DigitalOcean Kubernetes
  • Self-hosted: K8s on bare metal, OpenShift, Rancher

Architecture Support:

  • Multi-arch container images (amd64, arm64)
  • Tested on Linux, macOS (M1/M2), Windows WSL2

Quick Local Setup: For development/testing without an existing cluster, use Minikube (installation guide) or Microk8s (installation guide).

This guide uses Minikube for examples. Verify installation with:

minikube version

Issues with Minikube on Windows OS

If you are using Windows, you may encounter issues with Minikube related to the driver.

On Windows, Minikube can choose from a variety of drivers (hosts for the Kubernetes cluster), however generally you want to use either Hyper-V or Docker. If you have Docker installed, you should either use Minikube with the Docker driver or enable built-in Kubernetes support in Docker Desktop.

If you do not have Docker available, you should use the Hyper-V driver. To do this, you can consult this guide. Keep in mind that you will need to give Minikube administrator privileges to interface with Hyper-V.

kubectl (Logs & Troubleshooting)

Install kubectl to inspect Ilum resources and stream logs.

Install

Guide

  • macOS: brew install kubectl

  • Windows: winget install -e --id Kubernetes.kubectl (or) choco install kubernetes-cli

  • Linux:

    curl -LO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
    sudo install -m 0755 kubectl /usr/local/bin/kubectl

Quick use

kubectl get pods -n <ns>
kubectl logs -n <ns> <pod> --all-containers -f
kubectl describe pod -n <ns> <pod>
kubectl get events -n <ns> --sort-by=.lastTimestamp

Helm

Helm is a package manager for Kubernetes that allows you to define, install, and upgrade Kubernetes applications. If you haven't installed Helm yet, you can find instructions here.

Cluster Resource Allocation

Minikube resource allocation determines available capacity for Spark executors and Ilum services.

Configuration Options:

For full module testing (metadata, lineage, SQL):

minikube start --cpus 12 --memory 18192 --addons metrics-server

For minimal Spark workloads:

minikube start --cpus 6 --memory 12288 --addons metrics-server

The metrics-server addon exposes pod-level CPU/memory metrics to the Ilum UI dashboard.

Minikube Limitations:

  • Single-node cluster (no distributed executor scheduling)
  • Suitable for functional testing, not performance benchmarks

For production deployments, see Production Setup.

Helm Deployment

Add the Ilum chart repository:

helm repo add ilum https://charts.ilum.cloud
helm repo update

Includes Hive Metastore for table metadata, SQL query interface, and OpenLineage data tracking.

helm install ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-core.metastore.type=hive \
--set ilum-sql.enabled=true \
--set ilum-core.sql.enabled=true \
--set global.lineage.enabled=true

Capabilities enabled:

  • Centralized Hive Metastore (compatible with Spark, Presto, Trino)
  • SQL query execution with result pagination
  • Automatic lineage capture via OpenLineage hooks
  • Table/column-level lineage visualization

Resource overhead: ~8GB RAM, 6 CPU cores for metadata services.

Option 2: Basic Spark Execution Platform

Minimal deployment for Spark job submission and notebook execution. No persistent metadata or lineage tracking.

helm install ilum ilum/ilum

Use case: Development, testing, ephemeral workloads where table schemas are managed externally.

Option 3: Custom Module Selection

Use the module selector to generate Helm commands with specific integrations (Airflow, Superset, MLflow, etc.).

Deployment time: Services typically reach ready state within 2-6 minutes. Monitor with:

kubectl get pods -w

For advanced configuration options, see Helm chart documentation.

Installation Problems

In case you have any problems related to Ilum installation, vision troubleshooting section here or write us an email ([email protected]).

UI Access

The Ilum web interface provides job management, resource monitoring, and SQL query capabilities. Default credentials: admin / admin

Minikube Service Exposure

minikube service ilum-ui

Returns cluster-accessible URL (e.g., http://192.168.49.2:31777).

NodePort (Default)

The UI service is exposed via NodePort on 31777 by default. Find your node IP:

kubectl get nodes -o wide

Access at http://<NODE_IP>:31777.

Port Forwarding (Development)

kubectl port-forward svc/ilum-ui 9777:9777

Access at http://localhost:9777.

Ingress Controller (Production)

For production deployments, configure an Ingress resource with TLS termination. See Ingress configuration guide for details.

Authentication:

  • Default admin account: admin / admin
  • Change credentials via Helm values or UI user management
  • LDAP/OAuth2 integration available (see Security docs)

Submitting a Spark Application on UI

tip

New to Ilum? Learn the fastest path from install → first job. Take the official Ilum Course.

Now that your Kubernetes cluster is configured to handle Spark jobs via Ilum, let's submit a Spark application. For this example, we'll use the "SparkPi" example from the Spark documentation. You can download the required jar file from this link.

Ilum will create a Spark driver pod using the Spark 3.x docker image. The number of Spark executor pods can be scaled to multiple nodes as per your requirements.

Ilum

And that's it! You've successfully set up Ilum and run your first Spark job. Feel free to explore the Ilum UI and API for submitting and managing Spark applications. For traditional approaches, you can also use the familiar spark-submit command.

Interactive Spark Job with Scala/Java

Interactive jobs in Ilum are long-running sessions that can execute job instance data immediately. This is especially useful as there's no need to wait for Spark context to be initialized every time. If multiple users point to the same job ID, they will interact with the same Spark context.

To enable interactive capabilities in your existing Spark jobs, you'll need to implement a simple interface to the part of your code that needs to be interactive. Here's how you can do it:

First, add the Ilum job API dependency to your project:

Gradle

implementation 'cloud.ilum:ilum-job-api:6.3.0'

Maven

<dependency>
<groupId>cloud.ilum</groupId>
<artifactId>ilum-job-api</artifactId>
<version>6.3.0</version>
</dependency>

sbt

libraryDependencies += "cloud.ilum" % "ilum-job-api" % "6.3.0"

Then, implement the Job trait/interface in your Spark job. Here's an example:

Scala

package interactive.job.example

import cloud.ilum.job.Job
import org.apache.spark.sql.SparkSession

class InteractiveJobExample extends Job {

override def run(sparkSession: SparkSession, config: Map[String, Any]): Option[String] = {
val userParam = config.getOrElse("userParam", "None").toString
Some(s"Hello ${userParam}")
}
}

Java

package interactive.job.example;

import cloud.ilum.job.Job;
import org.apache.spark.sql.SparkSession;
import scala.Option;
import scala.Some;
import scala.collection.immutable.Map;
public class InteractiveJobExample implements Job {
@Override
public Option<String> run(SparkSession sparkSession, Map<String, Object> config) {
String userParam = config.getOrElse("userParam", () -> "None");
return Some.apply("Hello " + userParam);
}
}

In this example, the run method is overridden to accept a SparkSession and a configuration map. It retrieves a user parameter from the configuration map and returns a greeting message.

You can find a similar example on GitHub.

By following this pattern, you can transform your Spark jobs into interactive jobs that can execute calculations immediately, improving user interactivity and reducing waiting times.

Interactive Spark Job with Python

Below is an example of how to configure an interactive Spark job in Python using the ilum library:

  1. Spark Image Setup

    a) Use a Docker image from DockerHub
    Each Spark image we provide on DockerHub already has the necessary components built in.

    b) Install the ilum package
    If, for any reason, your Docker image does not include the ilum package or if you build your own custom image, you can install it (either within the container or locally) by running:

    pip install ilum
  2. Job Structure in ilum \

    The Spark job logic is encapsulated in a class that extends IlumJob, particularly within its run method

from ilum.api import IlumJob

class PythonSparkExample(IlumJob):
def run(self, spark, config):
# Job logic here

Simple interactive spark pi example:

from random import random
from operator import add

from ilum.api import IlumJob


class SparkPiInteractiveExample(IlumJob):

def run(self, spark, config):
partitions = int(config.get('partitions', '5'))
n = 100000 * partitions

def is_inside_unit_circle(_: int) -> float:
x = random() * 2 - 1
y = random() * 2 - 1
return 1.0 if x ** 2 + y ** 2 <= 1 else 0.0

count = (
spark.sparkContext.parallelize(range(1, n + 1), partitions)
.map(is_inside_unit_circle)
.reduce(add)
)

pi_approx = 4.0 * count / n
return f"Pi is roughly {pi_approx}"

You can find a similar example on GitHub.

Submitting an Interactive Spark Job on UI

After creating a file that contains your Spark code, you will need to submit it to Ilum. Here's how you can do it:

Open Ilum UI in your browser and create a new service:

Ilum

In the General tab put a name of a service

Ilum

In the Memory tab choose a cluster and set up your memory settings

Ilum

In the Resource tab upload your spark file

Ilum

Press Submit to apply your changes, and Ilum will automatically create a Spark driver pod. You can adjust the number of Spark executor pods by scaling them as needed.

Next, go to the Workloads section to locate your job. By clicking on its name, you can access its detailed view. Once the Spark container is ready, you can run the job by specifying the filename.classname and defining any optional parameters in JSON format.

Ilum

Now we have to put filename.classname in the Class filed:

Ilum_interactive_spark_pi.SparkPiInteractiveExample

and define the slices parameter in JSON format:

{
"partitions": 5
}

The first requests might take few seconds because of initialization phase, each another will be immediate.

Ilum

By following these steps, you can submit and run interactive Spark jobs using Ilum. This functionality provides real-time data processing, enhances user interactivity, and reduces the time spent waiting for results.