Ilum Blog - Free Data Lakehouse

How to run Apache Spark on Kubernetes in less than 5min

Ilum — Thu, 20 Nov 2025 15:11:00 GMT

Tools like Ilum will go a long way in simplifying the process of installing Apache Spark on Kubernetes. This guide will take you through, step by step, how to run Spark well on your Kubernetes cluster. With Ilum, deploying, managing, and scaling Apache Spark clusters is easily and naturally done.

Introduction

Today, we will showcase how to get up and running with Apache Spark on K8s. There are many ways to do that, but most are complex and require several configurations. We will use Ilum since that will do all the cluster setup for us. In the next blog post, we will compare the usage with the Spark operator.

Ilum is a free, modular data lakehouse to easily deploy and manage Apache Spark clusters. It has a simple API to define and manage Spark, it will handle all dependencies. It helps with the creation of your own managed spark.

With Ilum, you can deploy Spark clusters in minutes and get started immediately running Spark applications. Ilum allows you to easily scale out and in your Spark clusters, managing multiple Spark clusters from a single UI.

With Ilum, getting started is easy if you are relatively new to Apache Spark on Kubernetes.

Step-by-Step Guide to Install Apache Spark on Kubernetes

Quick start

We assume that you have a Kubernetes cluster up and running, just in case you don't, check out these instructions to set up a Kubernetes cluster on the minikube. Check how to install minikube.

Setup a local kubernetes cluster

Install Minikube: Execute the following command to install Minikube along with the recommended resources. This will install Minikube with 6 vCPUs and 12288 MB memory including the metrics server add-on that is necessary for monitoring.

minikube start --cpus 6 --memory 12288 --addons metrics-server

Once you have a running Kubernetes cluster, it is just a few commands away to install Ilum:

Install Spark on Kubernetes with Ilum

💡

You can also use the module selection tool from here to include features like SQL or n8n.

Add Ilum Helm Repository

helm repo add ilum https://charts.ilum.cloud

Install Ilum in Your Cluster

Here we have a few options.

a) The recommended one is to start with a few additional modules turned on (Data Lineage, SQL, Data Catalog).

helm install ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-sql.enabled=true \
--set ilum-core.sql.enabled=true \
--set global.lineage.enabled=true

b) you can also start with the most basic option which has only Spark and Jupyter notebooks.

helm install ilum ilum/ilum

c) there is also an option to use ilum's module selection tool here.

💡

Slow internet speed and large docker image size can lead to the failure of the Kubernetes pod due to the 2-minute download timeout. That's why we recommend pulling the image manually without getting a timeout.

minikube ssh docker pull ilum/core:6.6.0

This setup should take around two minutes. Ilum will deploy into your Kubernetes cluster, preparing it to handle Spark jobs.

Once the Ilum is installed, you can access the UI with port-forward and localhost:9777.

Port Forward to Access UI: Use Kubernetes port-forwarding to access the Ilum UI.

kubectl port-forward svc/ilum-ui 9777:9777

Use admin/admin as default credentials. You can change them during the deployment process.

spark ui

That’s all, your kubernetes cluster is now configured to handle spark jobs. Ilum provides a simple API and UI that makes it easy to submit Spark applications. You can also use the good old spark submit.

Deploy spark application on kubernetes

Let’s now start a simple spark job. We'll use the "SparkPi" example from the Spark documentation. You can use the jar file from this link.

ilum add spark job

Ilum will create a Spark driver kubernetes pod, it uses spark version 3.x docker image. You can control the number of spark executor pods by scaling them to multiple nodes. That's the simplest way to submit spark applications to K8s.

Running Spark on Kubernetes is really easy and frictionless with Ilum. It will configure your whole cluster and present you with an interface where you can manage and monitor your Spark cluster. We believe spark apps on Kubernetes are the future of big data. With Kubernetes, Spark applications will be able to handle huge volumes of data much more reliably, thus giving exact insights and being able to drive decisions with big data.

Submitting a Spark Application to Kubernetes (old style)

Submitting a Spark job to a Kubernetes cluster involves using the spark-submit script with configurations specific to Kubernetes. Here's a step-by-step guide:

Steps:

Prepare the Spark Application: Package your Spark application into a JAR file (for Scala/Java) or a Python script.
Use spark-submit to Deploy: Execute the spark-submit command with Kubernetes-specific options:
```
./bin/spark-submit \
  --master k8s://https://: \
  --deploy-mode cluster \
  --name spark-app \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=3 \
  --conf spark.kubernetes.container.image= \
  local:///path/to/your-app.jar
```
Replace:
- : Your Kubernetes API server host.
- : Your Kubernetes API server port.
- : The Docker image containing Spark.
- local:///path/to/your-app.jar: Path to your application JAR within the Docker image.

Key Configurations:

--master: Specifies the Kubernetes API URL.
--deploy-mode: Set to cluster to run the driver inside the Kubernetes cluster.
--name: Names your Spark application.
--class: Main class of your application.
--conf spark.executor.instances: Number of executor pods.
--conf spark.kubernetes.container.image: Docker image for Spark pods.

For more details, refer to the Apache Spark Documentation on Running on Kubernetes.

2. Creating a Custom Docker Image for Spark

Building a custom Docker image allows you to package your Spark application and its dependencies, ensuring consistency across environments.

Steps:

Create a Dockerfile: Define the environment and dependencies.

# Use the official Spark base image
FROM spark:3.5.3

# Set environment variables
ENV SPARK_HOME=/opt/spark
ENV PATH=$PATH:$SPARK_HOME/bin

# Copy your application JAR into the image
COPY your-app.jar $SPARK_HOME/examples/jars/

# Set the entry point to run your application
ENTRYPOINT ["spark-submit", "--class", "org.apache.spark.examples.SparkPi", "--master", "local[4]", "/opt/spark/examples/jars/your-app.jar"]

In this Dockerfile:

FROM spark:3.5.3: Uses the official Spark image as the base.
ENV: Sets environment variables for Spark.
COPY: Adds your application JAR to the image.
ENTRYPOINT: Defines the default command to run your Spark application.

Build the Docker Image: Use Docker to build your image.
```
docker build -t your-repo/your-spark-app:latest .
```
Replace your-repo/your-spark-app with your Docker repository and image name.
Push the Image to a Registry: Upload your image to a Docker registry accessible by your Kubernetes cluster.
```
docker push your-repo/your-spark-app:latest
```

While using spark-submit is a common method for deploying Spark applications, it may not be the most efficient approach for production environments. Manual submissions can lead to inconsistencies and are challenging to integrate into automated workflows. To enhance efficiency and maintainability, leveraging Ilum's REST API is recommended.

Automating Spark Deployments with Ilum's REST API

Ilum offers a robust RESTful API that enables seamless interaction with Spark clusters. This API facilitates the automation of job submissions, monitoring, and management, making it an ideal choice for Continuous Integration/Continuous Deployment (CI/CD) pipelines.

Benefits of Using Ilum's REST API:

Automation: Integrate Spark job submissions into CI/CD pipelines, reducing manual intervention and potential errors.
Consistency: Ensure uniform deployment processes across different environments.
Scalability: Easily manage multiple Spark clusters and jobs programmatically.

Example: Submitting a Spark Job via Ilum's REST API

To submit a Spark job using Ilum's REST API, you can make an HTTP POST request with the necessary parameters. Here's a simplified example using curl:

curl -X POST https:///api/v1/job/submit \
  -H "Content-Type: multipart/form-data" \
  -F "name=example-job" \
  -F "clusterName=default" \
  -F "jobClass=org.apache.spark.examples.SparkPi" \
  -F "jars=@/path/to/your-app.jar" \
  -F "jobConfig=spark.executor.instances=3;spark.executor.memory=4g"

In this command:

name: Specifies the job name.
clusterName: Indicates the target cluster.
jobClass: Defines the main class of your Spark application.
jars: Uploads your application JAR file.
jobConfig: Sets Spark configurations, such as the number of executors and memory allocation.

For detailed information on the API endpoints and parameters, refer to the Ilum API Documentation.

Enhancing Efficiency with Interactive Spark Jobs

Beyond automating job submissions, transforming Spark jobs into interactive microservices can significantly optimize resource utilization and response times. Ilum supports the creation of long-running interactive Spark sessions that can process real-time data without the overhead of initializing a new Spark context for each request.

Advantages of Interactive Spark Jobs:

Reduced Latency: Eliminates the need to start a new Spark context for every job, leading to faster execution.
Resource Optimization: Maintains a persistent Spark context, allowing for efficient resource management.
Scalability: Handles multiple requests concurrently within the same Spark session.

To implement an interactive Spark job with Ilum, you can define a Spark application that listens for incoming data and processes it in real-time. This approach is particularly beneficial for applications requiring immediate data processing and response.

For a comprehensive guide on setting up interactive Spark jobs and optimizing your Spark cluster, refer to Ilum's blog post: How to Optimize Your Spark Cluster with Interactive Spark Jobs.

By integrating Ilum's REST API and adopting interactive Spark jobs, you can streamline your Spark workflows, enhance automation, and achieve a more efficient and scalable data processing environment.

Advantages of Using Ilum to run Spark on Kubernetes

Ilum is equipped with an intuitive UI and a resilient API to scale and handle Spark clusters, configuring a couple of Spark applications from one interface. Here are a few great features in that regard:

Ease of Use: Ilum simplifies Spark configuration and management on Kubernetes with an intuitive Spark UI, eliminating complex setup processes.
Quick Deployment: Setup, deploy, and scale Spark clusters in minutes to speed up the time to execution and testing applications right away.
Scalability: Using the Kubernetes API, easily scale Spark clusters up or down to meet your data processing needs, ensuring optimal resource utilization.
Modularity: Ilum comes with a modular framework that allows users to choose and combine different components such as Spark History Server, Apache Jupyter, Minio, and much more.

Migrating from Apache Hadoop Yarn

Now that Apache Hadoop Yarn is in deep stagnation, more and more organizations are looking toward migrating from Yarn to Kubernetes. This is attributed to several reasons, but most common is that Kubernetes provides a more resilient and flexible platform in matters of managing Big Data workloads.

Generally, it is difficult to carry out a platform migration of the data processing platform from Apache Hadoop Yarn to any other. There are many factors to consider when such a switch is made—compatibility of data, speed, and cost of processing. However, it would come smoothly and successfully if the procedure is well planned and executed.

Kubernetes is pretty much a natural fit when it comes to Big Data workloads because of its inherent ability to be able to scale horizontally. But, with Hadoop Yarn, you are limited to the number of nodes in your cluster. You can increase and reduce the number of nodes inside a Kubernetes cluster on demand.

It also allows features which are not available in Yarn, for instance: self-healing and horizontal scaling.

Time to make the Switch to Kubernetes?

As the world of big data continues to evolve, so do the tools and technologies used to manage it. For years, Apache Hadoop YARN has been the de facto standard for resource management in big data environments. But with the rise of containerization and orchestration technologies like Kubernetes, is it time to make the switch?

Kubernetes has been gaining popularity as a container orchestration platform, and for good reason. It's flexible, scalable, and relatively easy to use. If you're still using traditional VM-based infrastructure, now might be the time to make the switch to Kubernetes.

If you're working with containers, then you should definitely care about Kubernetes. It can help you manage and deploy your containers more effectively, and it's especially useful if you're working with a lot of containers or if you're deploying your containers to a cloud platform.

Kubernetes is also a great choice if you're looking for an orchestration tool that's backed by a major tech company. Google has been using Kubernetes for years to manage its own containerized applications, and they've invested a lot of time and resources into making it a great tool.

There is no clear winner in the YARN vs. Kubernetes debate. The best solution for your organization will depend on your specific needs and use cases. If you are looking for a more flexible and scalable resource management solution, Kubernetes is worth considering. If you need better support for legacy applications, YARN may be a better option.

Whichever platform you choose, Ilum can help you get the most out of it. Our platform is designed to work with both YARN and Kubernetes, and our team of experts can help you choose and implement the right solution for your organization.

Managed Spark cluster

A managed Spark cluster is a cloud-based solution that makes it easy to provision and manage Spark clusters. It provides a web-based interface for creating and managing Spark clusters, as well as a set of APIs for automating cluster management tasks. Managed Spark clusters are often used by data scientists and developers who want to quickly provision and manage Spark clusters without having to worry about the underlying infrastructure.

Ilum provides the ability to create and manage your own spark cluster, which can be run in any environment, including cloud, on-premises, or a mixture of both.

The Pros of Apache Spark on Kubernetes

There has been some debate about whether Apache Spark should run on Kubernetes.

Some people argue that Kubernetes is too complex and that Spark should continue to run on its own dedicated cluster manager or stay in the cloud. Others argue that Kubernetes is the future of big data processing and that Spark should embrace it.

We are in the latter camp. We believe that Kubernetes is the future of big data processing and that Apache Spark should run on Kubernetes.

The biggest benefit of using Spark on Kubernetes is that it allows for much easier scaling of Spark applications. This is because Kubernetes is designed to handle deployments of large numbers of concurrent containers. So, if you have a Spark application that needs to process a lot of data, you can simply deploy more containers to the Kubernetes cluster to process the data in parallel. This is much easier than setting up a new Spark cluster on EMR each time you need to scale up your processing. You can run it on any cloud platform (AWS, Google Cloud, Azure, etc.) or on-premises. This means that you can easily move your Spark applications from one environment to another without having to worry about changing your cluster manager.

Another enormous benefit is that it allows for more flexible workflows. For example, if you need to process data from multiple sources, you can easily deploy different containers for each source and have them all processed in parallel. This is much easier than trying to manage a complex workflow on a single Spark cluster.

Kubernetes has several security features that make it a more attractive option for running Spark applications. For example, Kubernetes supports role-based access control, which allows you to fine-tune who has access to your Spark cluster.

So there you have it. These are just some of the reasons why we believe that Apache Spark should run on Kubernetes. If you're not convinced, we encourage you to try it out for yourself. We think you'll be surprised at how well it works.

Additional Resources

Conclusion

Ilum simplifies the process of installing and managing Apache Spark on Kubernetes, making it an ideal choice for both beginners and experienced users. By following this guide, you’ll have a functional Spark cluster running on Kubernetes in no time.

Try it, it's free

How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes

Florian Roscheck — Fri, 10 Oct 2025 14:58:11 GMT

Learn how to deploy the Ilum Data Platform on Google Cloud for development and experimentation and see how to go from a fast data lakehouse setup to production-ready data pipelines. Prefer a guided path? Level up with the Ilum Course at IlumCourse.com.

Why Ilum as Your Data Lakehouse Platform

Ilum is a Kubernetes-native data platform that lets teams stand up a modern data lakehouse in minutes, not months. You get:

Apache Spark as primary compute (with optional Trino for SQL),
Built-in SQL editor & notebooks, full Jupyter integration,
Orchestration & operations for Spark jobs and “virtual clusters”
Lineage & versioning (table diffs, ERD & column-level lineage),
Integrated OSS: Airflow, Superset (BI), MLflow, Gitea, and more.

This guide shows a lean, single-VM setup on Google Cloud, perfect for learning, POCs, and sandboxing your lakehouse workflows.

Who This Guide Is For

Developers & Data Engineers evaluating a modern data platform on Google Cloud
Analysts wanting quick SQL + dashboarding on a data lakehouse
Teams needing a repeatable dev environment before hardening to production

What You’ll Build

A single-VM Ilum data platform on Google Cloud that runs a development-grade data lakehouse:

Launch Kubernetes, Helm, and Ilum via a startup script
Run Spark jobs, SQL queries, and create Superset dashboards
Explore built-in data lineage and table versioning

Your Fast Track to a Google Cloud Data Lakehouse with Ilum

Setting up a new data platform can feel overwhelming: Too many tools, too many options. Ilum changes that. In this quick-start guide, you’ll get Ilum running on Google Cloud in under 30 minutes – a perfect setup for starting to explore Ilum.

This guide is perfect for developers, data engineers, and analysts who want to understand Ilum’s workflow in a hands-on way, but might not want or be able to install Ilum on their local machine. By the end, you will be ready to experiment with real data microservices, Spark jobs, and dashboards – just like in the official Ilum Course.

Step 1: Log into Google Cloud

First, sign in to the Google Cloud Console. If you don’t yet have an account, create one. You’ll get a free trial and can start experimenting right away.

Here is what you should see once you have logged in:

Step 2: Create a new Project

For billing and project structuring purposes, we will create a new project to run the VM (Virtual Machine) in. Here is how to create a project:

a. Click on "Select a project":

b. Then, click on "New project" on the top right:

c. Enter "ilum-vm" as a project name. You can leave the organization unassigned, then click "Create":

Step 3: Set up Billing (Don’t Worry, it’s minimal!)

For experimenting with Ilum, we’ll run its setup on a single virtual machine (VM). Expect costs around $0.10-$0.40 per hour, depending on your configuration. You will have to pay for the virtual machine and the associated disk space you use for your Ilum installation – unless you receive the free trial. For this, we need to set up billing.

Click on the search bar on the top and search for "billing". Once the "Billing" product appears, click it.

Click on "Create account":

Now, follow all instructions and add a payment method until you see the newly created billing account Overview:

Next, we need to assign our "ilum-vm" project to this billing account. Search for the project in the search bar and click it:

Click on "Billing" in the "Quick access" area:

Click on "Link a billing account":

Select the billing account you created in step 3 above and the click on "Set account":

Perfect, we are done setting up the billing for our new project!

Step 4: Enable the Compute API

In order to create a virtual machine (in Google Cloud called "compute"), we need to give it permission to do this. This is done via enabling the Compute API. Here is how that works:

Search for "Compute Engine API" in the search bar and click on it:

Click on "Enable":

Enabling may take a minute or two. Here is how things should look like after the API has been enabled. Note the "Status: Enabled" statement:

Step 5: Install the Google Cloud CLI

Setting up the virtual machine to run Ilum us much easier using the Google Cloud command line interface (CLI). Install it as described in the Google documentation. Make sure to run gcloud init to authenticate with your Google account.

💡

Can't Install the Google Cloud CLI?

On work computers, a lack of admin rights might mean you are unable to install the Google Cloud CLI. You can still install Ilum on Google Cloud and use it in the course. Scroll further down in this article to find an instruction for an alternative installation route that is slightly more cumbersome than using the CLI, but works without local installation.

Step 6: Launch the Virtual Machine

With all preparations taken, let's start the virtual machine and install Ilum!

Download the startup script:

Click here to download the script

The startup script is what we will use to automatically initialize the virtual machine. It will install Kubernetes, Helm, and Ilum – steps you could also take manually (but that would take you more time).

Instruct the Google Cloud CLI to create the machine

Paste the VM creation command below into a terminal/command line on your computer.

Make sure to run the terminal from the same directory where you have downloaded the startup script to or point the command below to the absolute path of the start_vm.sh script (line 13).

You can also adjust the ZONE to fit a zone close to you. Here, europe-north2-a was selected for its low price and sustainable footprint.

PROJECT_ID=ilum-vm 
ZONE=europe-north2-a 

gcloud config set project $PROJECT_ID 
gcloud compute instances create ilum-dev-node \
  --zone $ZONE \ 
  --machine-type e2-custom-12-18432 \ 
  --provisioning-model=SPOT \ 
  --instance-termination-action=STOP \ 
  --maintenance-policy=TERMINATE \ 
  --network-interface=subnet=default,network-tier=STANDARD \ 
  --image-family ubuntu-2204-lts --image-project ubuntu-os-cloud \ 
  --boot-disk-type=pd-balanced --boot-disk-size=100GB \ 
  --metadata-from-file startup-script=start_vm.sh

Once you have made all necessary adjustments, execute the code. Now, the machine should start up. This step installs Kubernetes, Helm, and Ilum so you can explore a development-grade data lakehouse on a single VM.

Note: Above, we have chosen a preemptible machine. This helps keep cost at ca. 1/4 of a non-preemptible machine but has an important disadvantage: When Google Cloud needs the compute for other clients, your virtual machine will be shut down. If you are willing to spend more on your virtual machine for the benefit of avoiding random shutdowns, then remove lines 7, 8, and 9 from the command above before creating the machine.

Monitor the startup process

After 5-10 seconds, execute the following code to watch what is happening inside the virtual machine:

gcloud compute ssh ilum-dev-node --zone $ZONE – \ 'sudo tail -f /var/log/startup-script.log'

Wait until "Startup script completed successfully." appears, this may take a couple of minutes:

Step 7: Access the Ilum UI

We're ready to explore Ilum! Run the following command to forward the port of the Ilum UI on the remote virtual machine to your own computer:

gcloud compute ssh ilum-dev-node --zone=$ZONE – -L 31777:localhost:31777

Now, open http://localhost:31777 in your browser to access Ilum. Access works as long as the above command is running.

To log in to your new Ilum installation, use the username "admin" and the password "admin". Welcome to your new Ilum installation:

Think of the Ilum UI as the control plane of your data platform, run Spark jobs, write SQL, check data lineage and build dashboards.

Important: This is strictly an installation for development (and course taking!) purposes – it lacks proper load balancing, security, and resilience to be used in a production environment.

Step 8: Manage the Virtual Machine

As long as the virtual machine is running and as long as disk space is reserved for the machine, it will cause cost. You can learn more about this cost here.

Shutting down

When you are not using the machine, e.g. when you have finished your learning session, shut down your machine like this:

gcloud compute instances stop ilum-dev-node --zone=$ZONE

Restarting

To restart the machine, use this command:

gcloud compute instances start ilum-dev-node --zone=$ZONE

When you have restarted the machine, it will take some minutes until Ilum is up and running again. You can check the status of Ilum's Kubernetes pods with the following command once you have connected to the virtual machine as shown in the last step:

kubectl get pods -n ilum

Ilum is ready to connect once all of its core containers are running again:

Removing

To completely remove the machine, incl. the attached storage, and stop it from causing any cost, use this command:

gcloud compute instances delete ilum-dev-node --zone "$ZONE" --delete-disks=all

Alternative Route Without Google Cloud CLI

If you cannot install the Google Cloud CLI on your machine as described above, then the setup of a virtual machine to run Ilum is slightly more involved. Follow the instructions below. This assumes that you have followed the tutorial above until and including "Step 4: Enable the Compute API".

Set up a Firewall Rule

Considering you intend to use the virtual machine for development purposes only, will not upload sensitive data, and Ilum has its own authentication system, the easiest and most practical way to expose Ilum for you to work with it is via opening a port in the firewall of the virtual machine.

We will only open this port to your own IP – decreasing the risk of unintended access by a third party. However, if you are on a company network, many devices might share the same IP which means your colleagues might be able to access the Ilum instance and log in if they know the IP of the instance and have Ilum's access credentials (the standard is "admin" as username and "admin" as password).

Search for "firewall" in the search bar and click on it:

Click on "Create firewall rule":

Configure the following settings. Make sure to enter your IP, followed by "/32". You can find out your IP via websites like ipinfo.io.

Configure and Launch the Virtual Machine

Search for "compute engine" (not "compute engine api" as before) in the search bar and click on it:

Click on "Create instance":

In the Machine Configuration tab, make the following settings:

Under OS and Storage, set up the virtual machine like this:

In the "Data Protection" tab, select "No backups" (this will save cost).
In the "Networking" tab, add the "ilum-ui" tag:

Under "Observability", disable "Install Ops agent for Monitoring and Logging" (this will save cost).
In the "Advanced" tab: Download the startup script below, open it in a text editor, and copy and paste its contents into the "Startup script" field. Then, select "Spot" as VM provisioning model (optional, read more about this in Step 2 under "Step 6: Launch the Virtual Machine" above):

Click here to download the script

Then, finally, click on "Create" to create the virtual machine:

Startup, incl. starting Ilum, will take a couple of minutes. You can monitor the status of the Ilum initialization via the logs:

Once you see "Startup script completed successfully", Ilum is ready to use. (To see live logs, you might have to click on the "stream logs" button on the top right.)

Connect to Ilum

To connect to Ilum, copy the External IP of the virtual machine into your browser bar (you can find it in Google Cloud in the list of VM instances, see below) and append ":31777" to it.

When you connect to this URL, you should see the Ilum login screen:

Manage the Virtual Machine

As long as the virtual machine is running and as long as disk space is reserved for the machine, it will cause cost. You can learn more about this cost here.

It is recommended to stop the machine when you are not using Ilum. This will bring cost down significantly, as you will only be paying for persistent storage but not for CPUs and RAM. After having stopped the machine, you can resume it when you are experimenting with Ilum again.

Once you are completely done experimenting with Ilum, delete the machine incl. its attached storage – after this, you will not incur any cost.

All machine management options are available in the menu available via the 3 dots in the far right of the machine table:

Congratulations! You now have Ilum up and running on Google Cloud and are ready to experiment, build, and explore. But if you want to rapidly go from “it works” to production-ready pipelines, Spark microservices, and live dashboards, then the official Ilum Course is your fast track. In just a few hours you’ll build real, deployable components (SQL, Spark jobs, Superset dashboards) and get hands-on guidance from the instructor and the Ilum team.

Join now!
Enroll in the Ilum Course →

From Dev to Production: Hardening Your Data Lakehouse

This single-VM setup is for development only. To run Ilum as a production-grade data platform on Google Cloud, plan to:

Deploy on a managed Kubernetes cluster (e.g., GKE) with multi-node resilience
Add load balancing, TLS/HTTPS, and OIDC/SSO
Configure backup/restore, autoscaling, and observability
Separate compute and storage; use object storage for data durability
Set up access control and network policies

Keep Building with Ilum

Architecture Overview → https://ilum.cloud/docs/architecture/
Use Cases (e.g., Transactions) → https://ilum.cloud/docs/use_cases/transaction/
Docs Home → https://ilum.cloud/docs/
Ilum Course → https://IlumCourse.com

Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)

Ilum — Fri, 08 Aug 2025 11:41:00 GMT

If you’ve done data science long enough, you’ve met this moment: you open a Jupyter notebook, hit Shift+Enter, and the cell just… hangs. Somewhere a Spark driver is starved for memory, or a cluster is “almost” configured, or the library versions don’t quite agree. You’re here to explore data, not babysit infrastructure.

That’s where Kubernetes shines and where Ilum, a free, cloud-native data lakehouse focused on running and monitoring Apache Spark on Kubernetes, helps you go from “why is this stuck?” to “let’s try that model” with far less friction.

In this post, we’ll take a narrative walk from a clean laptop to a working Jupyter and Apache Zeppelin setup on Kubernetes, both backed by Apache Spark and connected through a Livy-compatible API. You’ll see how Ilum plugs into the picture, why it matters for day-to-day data work, and how to scale from a single demo box to a real cluster without rewriting your notebooks.

Why Kubernetes for Data Science

Kubernetes (K8s) gives you three things notebooks secretly crave:

Elasticity: Executors scale up when a join explodes, and back down when you’re idle.
Isolation: Each user or team runs in a clean, containerized environment—no shared-conda-env roulette.
Repeatability: “It works on my machine” becomes “it works because it’s declarative.”

Pair K8s with Spark and a notebook front-end, and you get an interactive analytics platform that grows with your data. Pair it with Ilum, and you also get the plumbing and observability—logs and metrics—that keep you moving when things go sideways.

The missing piece: speaking Livy

Both Jupyter (via Sparkmagic) and Zeppelin (via its Livy interpreter) speak the Livy REST API to manage Spark sessions. Ilum implements that API through an embedded ilum-livy-proxy, so your notebooks can create and use Spark sessions while Ilum handles the Spark-on-K8s lifecycle behind the scenes.

Think of it this way:

Notebook → Livy API → Ilum → Spark on Kubernetes
You write cells. Ilum speaks Kubernetes. Spark does the heavy lifting.

No special notebook rewrites, no custom glue.

A gentle, from-scratch run-through

Let’s build a tiny playground locally with Minikube. Later, you can point the exact same setup at GKE/EKS/AKS or prem k8s distro.

1) Start a small Kubernetes cluster

You don’t need production horsepower to try this—just a few CPUs and memory.

minikube start --cpus 6 --memory 12288 --addons metrics-server

kubectl get nodes

You’ll see a node and the metrics add-on come up. That’s enough to host Spark drivers and executors for a demo.

2) Install Ilum with Jupyter, Zeppelin, and the Livy proxy

Ilum ships Helm charts so you don’t have to handcraft manifests.

helm repo add ilum https://charts.ilum.cloud
helm repo update

helm install ilum ilum/ilum \
  --set ilum-zeppelin.enabled=true \
  --set ilum-jupyter.enabled=true \
  --set ilum-livy-proxy.enabled=true

kubectl get pods -w

Grab a coffee while pods settle. When the dust clears, you have Ilum’s core plus bundled Jupyter and Zeppelin ready to talk to Spark through the Livy-compatible proxy.

Accessing the Ilum UI after installation

a) Port-forward (quick local access)

kubectl port-forward svc/ilum-ui 9777:9777

Open: http://localhost:9777

b) NodePort (stable for testing)
For testing, a NodePort is enabled by default (avoids port-forward drops).
Open: http://:31777
Tip: Get the node IP with kubectl get nodes -o wide (or minikube ip on Minikube).

c) Minikube shortcut

minikube service ilum-ui

This opens the service in your browser (or prints the URL) using your Minikube IP.

d) Production
Use an Ingress (TLS, domain, auth). See: https://ilum.cloud/docs/configuration/ilum-ui/#ilum-ui-ingress-parameters

First contact: Jupyter + Sparkmagic (via Ilum UI)

Open the Ilum UI in your browser. From the left navigation bar, go to Notebooks.

In Jupyter:

Create a Python 3 notebook.
Load Sparkmagic:

%load_ext sparkmagic.magics

Open the spark session manager

%manage_spark

jupyter ilum spark form

Basic Settings

Endpoint – Select the preconfigured Livy endpoint (usually http://ilum-livy-proxy:8998). This is how Jupyter/Sparkmagic talks to Ilum.
Cluster – Choose the target cluster (e.g., default). If you manage multiple K8s clusters in Ilum, pick the one you want your driver/executors to run on.
Session Name – Any short identifier (e.g., eda-january). You’ll see this in Sparkmagic session lists.
Language – python (PySpark) or scala. Most Jupyter users go with Python.
Spark Image – The container image for your Spark driver/executors (e.g., ilum/spark:3.5.3-delta). Images tagged with -delta already include Delta Lake.
Extra Packages – Comma-separated extras to pull into the session (e.g., numpy,delta).
- Tip: if you selected a -delta image, you usually don’t need to add delta again.
Enable autopause – When checked, Ilum will automatically pause the session after it’s idle for a while to save resources. You can resume it from the UI.

Resource Settings

Driver Settings
- Driver Memory – Start with 1g for demos; bump to 2–4g for heavier notebooks.
- Driver Cores – 1 is fine for exploratory work; increase if your driver does more coordination/collects.
Executor Settings (collapsed by default)
- Configure only if you want to override defaults; many users rely on dynamic allocation (below).

More Advanced Options

Custom Spark Config – JSON map for spark.* keys (e.g., event logs, S3 creds, serializer). Example:jsonCopyEdit{ "spark.eventLog.enabled": "true", "spark.sql.adaptive.enabled": "true" }
SQL Extension – Pre-fills for Delta Lake: io.delta.sql.DeltaSparkSessionExtension. Leave as-is if you plan to read/write Delta tables.
Driver Extra Java Options – JVM flags for the driver. The defaults (-Divy.cache.dir=/tmp -Divy.home=/tmp) keep Ivy dependency caches inside the container.
Executor Extra Java Options – Same idea, but for executors. Leave empty unless you need specific JVM flags.
Dynamic Allocation – Let Spark scale executors automatically.
- Min Executors – Floor for scaling (e.g., 1).
- Initial Executors – Startup size (e.g., 2–3).
- Max Executors – Ceiling (e.g., 10 for demos, higher in prod).
Shuffle Partitions – Number of partitions for wide ops (e.g., 200). Rule of thumb: start near 2–3× total executor cores, then tune.

Click Create Session. Ilum will start the Spark driver and executors on Kubernetes; the first pull can take a minute if images are new. When the status flips to available, you’re ready to run cells:

%%spark
spark.range(0, 100000).selectExpr("count(*) as n").show()

%%spark
from pyspark.sql import Row
rows = [Row(id=1, city="Warsaw"), Row(id=2, city="Riyadh"), Row(id=3, city="Austin")]
print(rows)

It’s a small example, but the important part is what just happened: Jupyter spoke Livy. Ilum created a Spark session on Kubernetes, Spark did the work. You didn’t touch a single YAML by hand.

Or if you prefer: Apache Zeppelin

Some teams love Zeppelin for its multi-language paragraphs and shareable notes. That works here too.

To execute code, we need to create a note:

2. As the communication with Ilum is handled via livy-proxy, we need to choose livy as a default interpreter.

3. Now let’s open the note and put some code into the paragraph:

Same to Jupyter, Zeppelin has also a predefined configuration that is needed for Ilum. You can customize the settings easily. Just open the context menu in the top right corner and click the interpreter button.

There is a long list of interpreters and their properties that could be customized.

Zeppelin provides 3 different modes to run the interpreter process: shared, scoped, and isolated. You can learn more about the interpreter binding mode here.

Same experience: Zeppelin sends code through Livy. Ilum spins up and manages the session on K8s. Spark runs it.

A quick detour into production thinking (without killing the vibe)

Here are the ideas you’ll care about when this graduates from demo to team-wide platform:

Storage: For a lakehouse, use object storage (S3/MinIO/GCS/Azure Blob). Keep Spark event logs there so the History Server can give you post-mortems that aren’t just vibes.
Security: Put Jupyter/Zeppelin behind SSO (OIDC/Keycloak), scope access with Kubernetes RBAC, and keep secrets in a manager, not in a notebook cell.
Autoscaling: Let the cluster scale node pools; let Spark dynamic allocation manage executors. Your wallet and your patience will thank you.
Costs: Spot/preemptible nodes for executors, right-size memory/cores, and avoid tiny files (Parquet/Delta/Iceberg for the win).

You don’t need to implement all of this today. The point is: you’re not stuck. The notebook you wrote for Minikube is the same notebook you’ll run next quarter on EKS.

Field notes: little problems you’ll actually hit

“Session stuck starting.” Usually resource pressure. Either give Minikube a bit more (--memory 14336) or lower Spark requests/limits for the demo.
ImagePullBackOff. Your node can’t reach the registry, or you need imagePullSecrets. Easy fix, don’t overthink it.
Slow reads on big datasets. You’re paying the tiny-file tax or skipping predicate pushdown. Compact to Parquet/Delta and filter early.

The good news: Ilum’s logs and metrics make these less mysterious. You’ll still debug—but with tools, not folklore.

Do you actually need Kubernetes for data science?

Strictly speaking? No. Plenty of useful analysis runs on a single machine. But as soon as your team grows or your data size stops being cute, Kubernetes buys you standardized environments, sane isolation, and predictable scaling. The more people share the same platform, the more those properties matter.

And the nice part is: with Ilum, moving to Spark on K8s doesn’t require tearing up your notebooks or learning the entire Kubernetes dictionary on day one. You point Jupyter/Zeppelin at a Livy-compatible endpoint and keep going.

Where to go next

Keep this demo, but try a real dataset: NYC Taxi, clickstream, retail baskets, anything columnar and not tiny.
Add Spark event logging to object storage so the History Server can tell you what actually happened.
If you’re on a cloud provider, deploy the same chart to GKE/EKS/AKS, add ingress + TLS, and connect SSO.

If you want a deeper dive, auth, storage classes, GPU pools for deep learning, or an example migration from YARN to Kubernetes, say the word and I’ll spin up a follow-up with concrete manifests.

Copy-paste corner

# Start local Kubernetes
minikube start --cpus 4 --memory 12288 --addons metrics-server

# Install Ilum with Jupyter, Zeppelin, and Livy proxy
helm repo add ilum https://charts.ilum.cloud
helm repo update
helm install ilum ilum/ilum \
  --set ilum-zeppelin.enabled=true \
  --set ilum-jupyter.enabled=true \
  --set ilum-livy-proxy.enabled=true

# Open Jupyter and get the token
kubectl port-forward svc/ilum-jupyter 8888:8888
kubectl logs -l app.kubernetes.io/name=ilum-jupyter

# Open Zeppelin
kubectl port-forward svc/ilum-zeppelin 8080:8080

#Jupyter cell to test:
%load_ext sparkmagic.magics
%manage_spark  # choose the predefined Ilum endpoint, then create session

%%spark
spark.range(0, 100000).selectExpr("count(*) as n").show()

A gentle nudge to try Ilum

Ilum is free, cloud-native, and built to make Spark on Kubernetes practical for actual teams—not just demo videos. You get the Livy-compatible endpoint, interactive sessions, and logs/metrics all in one place, so your notebooks feel like notebooks again.

https://ilum.cloud/resources/getting-started

Data Lakehouse: Transforming Enterprise Data Management

Ilum — Sat, 23 Nov 2024 21:14:00 GMT

In recent years, data lakehouses have emerged as an essential component for managing expansive data systems. Acting as the bridge between traditional data warehouses and contemporary data lakes, they bring together the strengths of both. This integration allows us to handle large data volumes efficiently and solve critical challenges faced in the data science landscape.

By blending the high-performance aspects of data warehouses with the scalability of data lakes, data lakehouses offer a unique solution. They address issues relating to data storage, management, and accessibility, making them indispensable in our digital era. As we explore this concept further, we'll uncover why data lakehouses are superior to the systems we once relied upon and the crucial role they play in ensuring data security and governance.

Key Takeaways

Data lakehouses combine features of data lakes and data warehouses.
They address major challenges in data storage and management.
Effective data governance is essential in data lakehouses.

What is a Data Lakehouse?

How Does a Lakehouse Operate?

In essence, a lakehouse combines features of data lakes and data warehouses. We gain the scalability and cost advantages of a data lake while benefiting from the management and performance of a warehouse. This design enables us to carry out analytics on both structured and unstructured data within a single framework. By removing isolated data storage, lakehouses facilitate better flow and integration.

Tracing the Origin of Relational Databases

Understanding the significance of a lakehouse requires a look back at the evolution of data management. In the 1980s, as businesses recognized the importance of insights, there emerged a need for systems that could handle extensive data. This transition led to the development of relational databases. They revolutionized data management by introducing SQL and ensuring data integrity with ACID properties.

Understanding Transaction Processing

At its core, transaction processing manages real-time data alterations. This involves inserting, updating, or removing data swiftly and accurately. Such systems guarantee that changes are executed correctly, or no alterations occur if an error arises. This reliability is vital for critical business applications where data precision must be maintained.

From Warehouses to New Horizons

Initially, data warehouses were tailored for fixed data formats. They excelled at detailed analytics but struggled as diverse data sources emerged. Their rigid structure proved expensive and inefficient for agile data analytics needs. As businesses expanded, so did their data requirements, prompting the advent of large-scale data storage solutions.

The Arrival of Data Lakes

Data lakes transformed how extensive data collections were managed. These solutions allowed organizations to store vast raw data without immediate organization, catering to diverse inputs like web logs and IoT feeds. A key advantage was the low cost of storage, although maintaining data quality and reliability were challenges that arose.

What is a Data Lake?

A data lake serves as a vast repository where raw data is stored until needed. Unlike warehouses requiring pre-organization, data lakes adopt a "schema-on-read" approach. This flexibility is beneficial for data scientists and analysts, allowing examination and interpretation without fixed structures.

Benefits of Large Data Repositories

Scalability: They manage substantial data without significant infrastructure changes.
Cost Efficiency: Storage in data lakes is more affordable, reducing operational expenses.
Diverse Data Support: They accommodate structured, semi-structured, and unstructured data effectively, making them versatile for various analytics needs.

By evolving from traditional systems while incorporating the versatility of lakes, the lakehouse concept provides a modern approach to managing and analyzing data, merging the best of both foundational methods.

Recap: From Data Lake to Data Swamp

Building a good data lakehouse definitely has its challenges. In the beginning, businesses were all in on data lakes, thinking they’d be the magic solution to all their storage problems. But without proper management, these lakes can turn into data swamps, where it’s way harder to dig out anything useful.

What Exactly is a Data Swamp?

When businesses first embraced data lakes, they hoped for an ideal solution to their storage issues. But without proper structure and oversight, these data lakes can become chaotic data collections, or swamps. In such a state, finding useful information becomes a challenge. Here are some of the problems:

Duplicate Data: Copies of data can accumulate, leading to confusion and higher storage costs.
Poor Data Quality: Inaccurate data leads to wrong decisions, impacting overall business performance.
Regulatory Issues: Mismanaged data can mean failing to meet legal data protection standards.

Data silos and data staleness often emerge from these disorganized repositories, leading to isolated datasets and outdated information which further hamper our ability to make timely decisions.

Characteristics of a Data Lakehouse

To counter these issues, the data lakehouse concept emerged, offering a more balanced approach to data management. This system allows us to store vast amounts of raw data, providing flexibility for analysts and data scientists. Unlike older systems, it aligns with modern data science and machine learning needs, facilitating advanced analytics.

The data lakehouse combines elements from both data lakes and warehouses. Let’s explore its features:

Reliable Transactions: Supports transactions, ensuring data is accurate and dependable.
Structured Data: Uses schema enforcement to keep data organized and reliable.
Separate Storage and Processing: Decouples storage and compute, optimizing efficiency.
Flexible Formats: Compatible with open table formats like Delta, Iceberg, and Hudi.
Versatile Data Handling: Handles structured, semi-structured, and unstructured data.
Real-Time Streaming: Fully supports streaming, enabling up-to-date analytics.

These features address the limitations of traditional systems, allowing us to work with data more effectively. By capitalizing on these strengths, we can position ourselves well in an increasingly data-driven world.

Data Governance in Data Lakehouses

Data governance in a lakehouse setup is crucial for maintaining accuracy, accessibility, and security, while also complying with regulations. We ensure that our data remains reliable by focusing on several aspects:

Data Catalog: We organize all data and metadata, allowing for easy discovery and retrieval.
Accountability and Quality: Our data stewards are responsible for maintaining data quality and consistency.
Controlled Access: By implementing role-based access, we make sure only authorized individuals can view sensitive information.

These practices help us maintain a flexible and interoperable data environment, ensuring privacy and consistency.

Comparing Data Lakehouses and Data Warehouses

The architecture of a data lakehouse offers unique advantages over traditional data warehouses. While warehouses are tailored for structured data and excel in analytics, lakehouses provide flexibility by allowing both structured and unstructured data to coexist. This approach gives organizations the ability to leverage diverse data types efficiently.

Key Differences:

Data Storage: Warehouses require data to be structured before storage, while lakehouses can keep raw data, processing it as needed.
Query Performance: Warehouses excel in complex structured data queries, whereas lakehouses support varied data types with faster queries using tools like Apache Spark.
Cost: Lakehouses often use economical storage, reducing costs compared to the high-performance storage required by warehouses.
Scalability: Lakehouses scale easily with additional storage nodes, unlike warehouses that have scalability limits as data sizes increase.

Schema Evolution in Data Lakehouses

Ilum - Schema Evolution in Data Lakehouses

Schema evolution is very important because it lets businesses adjust their data setup without messing up their current workflows. And honestly, in today’s fast-moving data world, that kind of flexibility is a must.

Embracing New Standards

Previously, changing database schemas, such as adding columns or altering structures, was complicated and could lead to downtime. With lakehouses, schema changes are straightforward and built into the system. This enables our teams to adapt quickly to new data requirements, maintaining efficient operations.

Making the System Effective

Version Control: We track dataset versions to accommodate changes while supporting older formats.
Automated Schema Recognition: Employing tools that detect schema alterations ensures our data processing workflows remain fluid.
Data Scrutiny: By implementing validation rules, we ensure any incoming data conforms to expected formats, preventing processing issues.

Using these strategies, we can make our data systems more responsive and robust, handling the evolving demands of data management effectively.

Keeping Your Data Secure and Ready: Why It's Important

The Role of Cloud Storage

Cloud object storage plays a vital role in ensuring our data stays safe and accessible. This type of storage keeps our digital assets—whether structured business data or varied media files—well-organized and secure. Features such as backups and versioning are essential because they offer peace of mind. If any data becomes corrupted or lost, we can swiftly restore it, helping us avoid potential disruptions.

Flexible Open Data Formats

Open data standards are crucial for data flexibility. By using formats like Parquet or ORC, we ensure our data remains adaptable. This way, we're not tied to a single tool or provider, which means we can adjust our systems as needed. This flexibility is key to making sure our data can be utilized efficiently across different platforms and tools.

Business Benefits of Reliable Data Management

A well-structured data environment using cloud object storage and open formats is advantageous for any business. It guarantees our business data is both secure and accessible when needed. Whether we manage structured data sets or varied media content, we gain the flexibility and reliability necessary for our operations. As our business evolves or the volume of data grows, having a setup that adapts to these changes is essential. This approach ensures we can keep pace with our data needs and maintain smooth business operations.

The Future of Data Lakehouses

Data architecture is continuing to grow and adapt to the increasing demands of data analytics and data science. As more companies dive into AI and machine learning, having a solid and flexible data setup is going to be crucial.

Connecting with AI and Machine Learning

Data lakehouses provide a strong foundation for tasks like machine learning. By merging structured and unstructured data on a single platform, we can streamline the workflow of data scientists. This setup helps in both developing and deploying machine learning models effectively, enhancing our data science capabilities.

What Lies Ahead?

With ongoing tech progress, data lakehouses will continue to evolve. We anticipate enhancements such as automated data governance, improved security measures, and performance-boosting tools. These updates will reinforce the role of data lakehouses in modern data strategies, ensuring they remain integral to our efforts in managing and analyzing data efficiently.

Why Ilum is a Perfect Example of a Well-Defined Data Lakehouse

Ilum embodies what a data lakehouse should be, harmonizing the versatility of data lakes with the comprehensive control of data warehouses. Let's delve into the reasons why Ilum stands out in this space.

Unified Multi-Cluster Management
Our platform simplifies the management of multiple Spark clusters whether they are cloud-based or on-premise. This feature ensures seamless data handling across different environments.
Kubernetes and Hadoop Flexibility
Ilum supports both Kubernetes and Hadoop Yarn, offering businesses the choice to manage their Spark clusters in a way that suits them best. This flexibility empowers companies to transition from traditional Hadoop setups to modern, cloud-native environments, adapting to today's technology-driven landscape.
Interactive Spark Sessions and REST API
By utilizing our REST API for Spark jobs, Ilum enhances interactivity, allowing for real-time data operations. This not only elevates the data platform experience but also enables the creation of dynamic applications that respond instantly to user requests—an essential feature for advanced data lakehouses.
Open-Source and Free Accessibility
A remarkable trait of Ilum is its cost-efficiency, as it is available at no expense. Utilizing open-source tools such as Apache Spark, Jupyter, and Apache Ranger, Ilum avoids vendor lock-in, making it an attractive option for startups and enterprises alike to explore data lakehouse architecture without hefty costs.

The strengths of Ilum lie in its scalability, flexibility, real-time interactivity, and affordability. It caters to those who seek a well-architected data lakehouse that doesn't compromise performance or governance. Embracing Ilum's advanced features empowers us to fully leverage the potential of a modern data lakehouse solution, truly blending the benefits of both data lakes and warehouses.

Frequently Asked Questions

What are the Main Components of a Data Lakehouse?

Data lakehouses combine elements of both data lakes and data warehouses. Key components include a storage layer that handles large volumes of structured and unstructured data, a processing layer for executing data queries and transformations, and a management layer to maintain data organization and governance.

How Does Data Lakehouse Performance Compare to Traditional Data Warehouses?

Data lakehouses often have enhanced performance due to their capability to handle diverse data types and perform complex queries. They integrate the flexible storage from data lakes with the efficient query performance of data warehouses, offering a balanced approach to data storage and computation.

What are the Advantages of Using a Data Lakehouse for Data Analysis?

Using a data lakehouse can streamline data analytics by providing a single platform that supports both storage and analytics. This integration reduces data movement and duplication, enabling faster insights and more efficient data management. Moreover, data lakehouses offer scalability and flexibility, essential for handling large data sets.

What Tools and Technologies Are Common in Building a Data Lakehouse?

Common tools include Apache Spark for processing large data sets and Delta Lake for offering reliable data indexing and version control. Technologies like cloud storage services and data governance tools are integral in managing large-scale data lakehouses efficiently.

How Do Data Lakehouses Manage Data Security and Governance?

Data governance and security are managed by implementing robust authentication protocols, encryption techniques, and data masking. This ensures that only authorized users can access sensitive information, safeguarding the data integrity and privacy within the lakehouse environment.

When is a Data Lakehouse Preferred Over a Data Lake?

A data lakehouse is preferred when there is a need to support both analytics workloads and traditional operational query workloads on diverse data types. It is ideal for organizations requiring a unified system that reduces data silos and simplifies data management processes.

Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum.

Ilum — Thu, 27 Jul 2023 13:00:55 GMT

Greetings Ilum enthusiasts and Python fans! We're thrilled to unveil a new, eagerly expected feature that's set to empower your data science journey - full Python support in Ilum. For those in the data world, Python and Apache Spark have long been an iconic duo, seamlessly handling vast volumes of data and complex computations. And now, with Ilum's latest upgrade, you can harness the power of Python right inside your favourite data lake environment.

This blog post is your guided tour to exploring this feature. We'll kick things off with a simple Apache Spark job written in Python, run it on Ilum, and then dive deeper. We'll transform the initial code to support an interactive mode, offering you direct access to the Spark job via Ilum's API. By the end of this journey, you'll have a Python-based microservice responding to API calls, all running smoothly on Ilum.

So, are you ready to enhance your data game with Python and Ilum? Let's get started.

All examples are available on our GitHub repository.

Step 1: Writing a Simple Apache Spark Job in Python.

Before we embark on our Python journey with Ilum, we need to ensure our environment is well-equipped. To run a Spark job, you need to have Ilum and PySpark installed. You can use pip, the Python package installer, to set up PySpark. Make sure you're using Python >=3.9.

pip install pyspark

For setting up and accessing Ilum, please follow the guidelines provided here.

1.1 SparkPi example.

Now, let's dive into writing our Spark job. We'll start with a simple example of SparkPi

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .appName("PythonPi") \
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_: int) -> float:
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

Save this script as ilum_python_simple.py

With our Spark job ready, it's time to run it on Ilum. Ilum offers the capability to submit jobs using the Ilum UI or through the REST API.

Let's start with the UI with the single job feature.

We can achieve the same thing with the API, but first, we need to expose ilum-core API with port forward.

kubectl port-forward svc/ilum-core 9888:9888

With the exposed port we can make an API call.

curl -X POST 'localhost:9888/api/v1/job/submit' \
        --form 'name="ilumSimplePythonJob"' \
        --form 'clusterName="default"' \
        --form 'jobClass="ilum_python_simple"' \
        --form 'args="10"' \
        --form 'pyFiles=@"/path/to/ilum_python_simple.py"' \
        --form 'language="PYTHON"'

API call

As a result, we will receive the id of the created job.

{"jobId":"20230724-1154-m78f3gmlo5j"}

Result

To check the logs of the job we can make an API call to

curl localhost:9888/api/v1/job/20230724-1154-m78f3gmlo5j/logs

API call

And that's it! You've written and run a simple Python Spark job on Ilum. Let's look at a little more advanced example which needs additional Python libraries.

1.2 Job example with numpy.

In this section, we'll go over a practical example of a Spark job written in Python. This job involves reading a dataset, processing it, training a machine learning model on it, and saving the predictions. We're going to use a Tel-churn.csv file, which you can find in our GitHub repository. To make things easy, we've uploaded this file to a bucket named ilum-files in the build-in instance of MinIO, which is automatically accessible from the Ilum instance. This means you won't have to worry about configuring any accesses for this example - Ilum has got it covered. However, if you ever want to fetch data from a different bucket or use Amazon S3 in your own projects, you'll need to configure the accesses accordingly.

Now that we've got our data ready, let's get started with writing our Spark job in Python. Here is the full code example:

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

if __name__ == "__main__":

    spark = SparkSession \
        .builder \
        .appName("IlumAdvancedPythonExample") \
        .getOrCreate()
    
    df = spark.read.csv('s3a://ilum-files/Tel-churn.csv', header=True, inferSchema=True)

    categoricalColumns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
                          'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
                          'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

    stages = []

    for categoricalCol in categoricalColumns:
        stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
        stages += [stringIndexer]

    label_stringIdx = StringIndexer(inputCol="Churn", outputCol="label")
    stages += [label_stringIdx]

    numericCols = ['SeniorCitizen', 'tenure', 'MonthlyCharges']

    assemblerInputs = [c + "Index" for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
    stages += [assembler]

    pipeline = Pipeline(stages=stages)
    pipelineModel = pipeline.fit(df)
    df = pipelineModel.transform(df)

    train, test = df.randomSplit([0.7, 0.3], seed=42)

    lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)
    lrModel = lr.fit(train)

    predictions = lrModel.transform(test)

    predictions.select("customerID", "label", "prediction").show(5)
    predictions.select("customerID", "label", "prediction").write.option("header", "true") \
        .csv('s3a://ilum-files/predictions')

    spark.stop()

Let's dive into the code:

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

Here, we're importing the necessary PySpark modules to create a Spark session, build a machine learning pipeline, preprocess the data, and run a Logistic Regression model.

spark = SparkSession \
    .builder \
    .appName("IlumAdvancedPythonExample") \
    .getOrCreate()

We initialize a SparkSession, which is the entry point to any functionality in Spark. This is where we set the application name that will appear on the Spark web UI.

df = spark.read.csv('s3a://ilum-files/Tel-churn.csv', header=True, inferSchema=True)

We're reading a CSV file stored on an minio bucket. The header=True option tells Spark to use the first row of the CSV file as headers, while inferSchema=True makes Spark automatically determine the data type of each column.

categoricalColumns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
                      'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
                      'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

We specify the columns in our data that are categorical. These will be transformed later using a StringIndexer.

stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    stages += [stringIndexer]

Here, we're iterating over our list of categorical columns and creating a StringIndexer for each. StringIndexers encode categorical string columns into a column of indices. The transformed index column will be named as the original column name appended with "Index".

numericCols = ['SeniorCitizen', 'tenure', 'MonthlyCharges']

assemblerInputs = [c + "Index" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Here we prepare the data for our machine learning model. We create a VectorAssembler which will take all our feature columns (both categorical and numerical) and assemble them into a single vector column. This is a requirement for most machine learning algorithms in Spark.

train, test = df.randomSplit([0.7, 0.3], seed=42)

We split our data into a training set and a test set, with 70% of the data for training and the remaining 30% for testing.

lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)
lrModel = lr.fit(train)

We train a Logistic Regression model on our training data.

predictions = lrModel.transform(test)

predictions.select("customerID", "label", "prediction").show(5)
predictions.select("customerID", "label", "prediction").write.option("header", "true") \
    .csv('s3a://ilum-files/predictions')

Lastly, we use our trained model to make predictions on our test set, displaying the first 5 predictions. Then we write these predictions back to our minio bucket.

Save this script as ilum_python_advanced.py

pyspark.ml uses numpy as a dependency which is not installed as default so we need to specify it as a requirement.

And the same thing can be done through the API.

curl -X POST 'localhost:9888/api/v1/job/submit' \
        --form 'name="IlumAdvancedPythonExample"' \
        --form 'clusterName="default"' \
        --form 'jobClass="ilum_python_advanced"' \
        --form 'pyRequirements="numpy"' \
        --form 'pyFiles=@"/path/to/ilum_python_advanced.py"' \
        --form 'language="PYTHON"'

API call

In the next sections, we'll transform both Python scripts into an interactive Spark job, taking full advantage of Ilum's capabilities.

Step 2: Transitioning to Interactive Mode

Interactive mode is an exciting feature that makes Spark development more dynamic, giving you the capability to run, interact, and control your Spark jobs in real time. It's designed for those who seek more direct control over their Spark applications.

Think of Interactive mode as having a direct conversation with your Spark job. You can feed in data, request transformations, and fetch results - all in real time. This drastically enhances the agility and capability of your data processing pipeline, making it more adaptable and responsive to changing requirements.

Now that we're familiar with creating a basic Spark job in Python, let's take things a step further by transforming our job into an interactive one that can take advantage of Ilum's real-time capabilities.

2.1 SparkPi example.

To illustrate how to transition our job to Interactive mode, we will adjust our earlier ilum_python_simple.py script.

from random import random
from operator import add

from ilum.api import IlumJob


class SparkPiInteractiveExample(IlumJob):

    def run(self, spark, config):
        partitions = int(config.get('partitions', '5'))
        n = 100000 * partitions

        def f(_: int) -> float:
            x = random() * 2 - 1
            y = random() * 2 - 1
            return 1 if x ** 2 + y ** 2 <= 1 else 0

        count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

        return "Pi is roughly %f" % (4.0 * count / n)

Save this as ilum_python_simple_interactive.py

There are just a few differences from the original SparkPi.

1. Ilum package

To start off, we import the IlumJob class from the ilum package, which serves as a base class for our interactive job.

The Spark job logic is encapsulated in a class that extends IlumJob, particularly within its run method. We can add ilum package with:

pip install ilum

2. Spark job in a class

The Spark job logic is encapsulated in a class that extends IlumJob, particularly within its run method.

class SparkPiInteractiveExample(IlumJob):
    def run(self, spark, config):
        # Job logic here

Wrapping the job logic in a class is essential for the Ilum framework to handle the job and its resources. This also makes the job stateless and reusable.

3. Parameters are handled differently:

We are taking all arguments from the config dictionary

partitions = int(config.get('partitions', '5'))

This shift allows for more dynamic parameter passing and integrates with Ilum's configuration handling.

4. The result is returned instead of printed:

The result is returned from the run method.

return "Pi is roughly %f" % (4.0 * count / n)

By returning the result, Ilum can handle it in a more flexible way. For instance, Ilum could serialize the result and make it accessible via an API call.

5. No need to manually manage Spark session

Ilum manages the Spark session for us. It's automatically injected into the run method and we don't need to stop it manually.

def run(self, spark, config):

These changes highlight the transition from a standalone Spark job to an interactive Ilum job. The goal is to improve the flexibility and reusability of the job, making it more suited for dynamic, interactive, and on-the-fly computations.

Adding interactive spark job is handled with the 'new group' function.

And the execution with the interactive job function on UI.
The class name should be specified as a pythonFileName.PythonClassImplementingIlumJob

We can achieve the same thing with the API.

1. Creating group

curl -X POST 'localhost:9888/api/v1/group' \
        --form 'name="SparkPiInteractiveExample"' \
        --form 'kind="JOB"' \
        --form 'clusterName="default"' \
        --form 'pyFiles=@"/path/to/ilum_python_simple_interactive.py"' \
        --form 'language="PYTHON"'

API call

{"groupId":"20230726-1638-mjrw3"}

Result

2. Job execution

curl -X POST 'localhost:9888/api/v1/group/20230726-1638-mjrw3/job/execute' \
	-H 'Content-Type: application/json' \
	-d '{ "jobClass":"ilum_python_simple_interactive.SparkPiInteractiveExample", "jobConfig": {"partitions":"10"}, "type":"interactive_job_execute"}'

API call

{
   "jobInstanceId":"20230726-1638-mjrw3-a1srahhu",
   "jobId":"20230726-1638-mjrw3-wwt5a",
   "groupId":"20230726-1638-mjrw3",
   "startTime":1690390323154,
   "endTime":1690390325200,
   "jobClass":"ilum_python_simple_interactive.SparkPiInteractiveExample",
   "jobConfig":{
      "partitions":"10"
   },
   "result":"Pi is roughly 3.149400",
   "error":null
}

Result

2.2 Job example with numpy.

Let's look at our second example.

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

from ilum.api import IlumJob


class LogisticRegressionJobExample(IlumJob):

    def run(self, spark_session: SparkSession, config: dict) -> str:
        df = spark_session.read.csv(config.get('inputFilePath', 's3a://ilum-files/Tel-churn.csv'), header=True,
                                    inferSchema=True)

        categoricalColumns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
                              'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
                              'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

        stages = []

        for categoricalCol in categoricalColumns:
            stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
            stages += [stringIndexer]

        label_stringIdx = StringIndexer(inputCol="Churn", outputCol="label")
        stages += [label_stringIdx]

        numericCols = ['SeniorCitizen', 'tenure', 'MonthlyCharges']

        assemblerInputs = [c + "Index" for c in categoricalColumns] + numericCols
        assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
        stages += [assembler]

        pipeline = Pipeline(stages=stages)
        pipelineModel = pipeline.fit(df)
        df = pipelineModel.transform(df)

        train, test = df.randomSplit([float(config.get('splitX', '0.7')), float(config.get('splitY', '0.3'))],
                                     seed=int(config.get('seed', '42')))

        lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=int(config.get('maxIter', '5')))
        lrModel = lr.fit(train)

        predictions = lrModel.transform(test)

        return '{}'.format(predictions.select("customerID", "label", "prediction").limit(
            int(config.get('rowLimit', '5'))).toJSON().collect())

1. We wrap the job in a class, just like in the previous example:

class LogisticRegressionJobExample(IlumJob):
    def run(self, spark_session: SparkSession, config: dict) -> str:
        # Job logic here

Again, the job logic is encapsulated in the run method of a class extending IlumJob, helping Ilum to handle the job efficiently.

2. All parameters, including those for the data pipeline (like file paths and Logistic Regression hyperparameters), are obtained from the config dictionary:

df = spark_session.read.csv(config.get('inputFilePath', 's3a://ilum-files/Tel-churn.csv'), header=True, inferSchema=True)
train, test = df.randomSplit([float(config.get('splitX', '0.7')), float(config.get('splitY', '0.3'))], seed=int(config.get('seed', '42')))
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=int(config.get('maxIter', '5')))

By centralizing all parameters in one place, Ilum provides a uniform, consistent way of configuring and tuning the job.

The result of the job, rather than being written to a specific location, is returned as a JSON string:

return '{}'.format(predictions.select("customerID", "label", "prediction").limit(int(config.get('rowLimit', '5'))).toJSON().collect())

This allows for more dynamic and flexible handling of the job result, which could then be processed further or exposed via an API, depending on the needs of the application.

This code perfectly showcases how we can seamlessly integrate PySpark jobs with Ilum to enable interactive, API-driven data processing pipelines. From simple examples like Pi approximation to more complex cases like Logistic Regression, Ilum's interactive jobs are versatile, adaptable, and efficient.

Step 3: Making Your Spark Job a Microservice

Microservices bring in a paradigm shift from the traditional monolithic application structure to a more modular and agile approach. By breaking down a complex application into small, loosely coupled services, it becomes easier to build, maintain, and scale each service independently based on specific requirements. When applied to our Spark job, this means we could create a robust data processing service that could be scaled, managed, and updated without affecting other parts of our application stack.

The power of turning your Spark job into a microservice lies in its versatility, scalability, and real-time interaction capabilities. A microservice is an independently deployable component of an application that runs as a separate process. It communicates with other components via well-defined APIs, giving you the freedom to design, develop, deploy, and scale each microservice independently.

In the context of Ilum, an interactive Spark job can be treated as a microservice. The job's 'run' method acts as an API endpoint. Each time you call this method via Ilum's API, you're making a request to this microservice. This opens up the potential for real-time interactions with your Spark job.

You can make requests to your microservice from various applications or scripts, fetching data, and processing results on the fly. Moreover, it opens up an opportunity to build more complex, service-oriented architectures around your data processing pipelines.

One key advantage of this setup is scalability. Through the Ilum UI or API, you can scale your job (microservice) up or down based on the load or the computational complexity. You don't need to worry about manually managing resources or load balancing. Ilum’s internal load balancer will distribute API calls between instances of your Spark job, ensuring efficient resource utilization.

Keep in mind that the actual processing time of the job depends on the complexity of the Spark job and the resources allocated to it. However, with the scalability provided by Kubernetes, you can easily scale up your resources as your job's requirements grow.

This combination of Ilum, Apache Spark, and microservices brings about a new, agile way to process your data - efficiently, scalably, and responsively!

The Game-Changer in Data Microservice Architecture

We've come a long way since we started this journey of transforming a simple Python Apache Spark job into a full-blown microservice using Ilum. We saw how easy it was to write a Spark job, adapt it to work in interactive mode, and ultimately expose it as a microservice with the help of Ilum's robust API. Along the way, we leveraged the power of Python, the capabilities of Apache Spark, and the flexibility and scalability of Ilum. This combination has not only transformed our data processing capabilities but also changed the way we think about data architecture.

The journey doesn't stop here. With full Python support on Ilum, a new world of possibilities opens up for data processing and analytics. As we continue to build and improve on Ilum, we're excited about the future possibilities that Python brings to our platform. We believe that with Python and Ilum together, we're just at the beginning of redefining what's possible in the world of data microservice architecture.

Join us on this exciting journey, and let's shape the future of data processing together!

How to optimize your Spark Cluster with Interactive Spark Jobs

Ilum — Wed, 14 Sep 2022 12:24:13 GMT

In this article, you will learn:

How to decrease your spark job execution time
What is an interactive job in Ilum
How to run an interactive spark job
Differences between running a spark job using Ilum API and Spark API

Ilum job types

There are three types of jobs you can run in Ilum: single job, interactive job and interactive code. In this article, we'll focus on the interactive job type. However, it's important to know the differences between the three types of jobs, so let's take a quick overview of each one.

With single jobs, you can submit code-like programs. They allow you to submit a Spark application to the cluster, with pre-compiled code, without interaction during runtime. In this mode, you have to send a compiled jar to Ilum, which is used to launch a single job. You can either send it directly, or you can use AWS credentials to get it from an S3 bucket. A typical example of a single job usage would be some kind of data preparation task.

Ilum also provides an interactive code mode, which allows you to submit commands at runtime. This is useful for tasks where you need to interact with the data, such as exploratory data analysis.

Interactive job

Interactive jobs have long-running sessions, where you can send job instance data to be executed right away. The killer feature of such a mode is that you don’t have to wait for spark context to be initialized. If users were pointing to the same job id, they would interact with the same spark context. Ilum wraps Spark application logic into a long-running Spark job which is able to handle calculation requests immediately, without the need to wait for Spark context initialization.

Starting an interactive job

Let’s take a look at how Ilum’s interactive session can be started. The first thing we have to do is to set up Ilum. You can do it easily with the minikube. A tutorial with Ilum installation is available under this link. In the next step, we have to create a jar file which contains an implementation of Ilum's job interface. To use Ilum job API, we have to add it to the project with some dependency managers, such as Maven or Gradle. In this example, we will use some Scala code with a Gradle to calculate PI.

The full example is available on our GitHub.

If you prefer not to build it yourself, you can find the compiled jar file here.

The first step is to create a folder for our project and change the directory into it.

$ mkdir interactive-job-example
$ cd interactive-job-example

If you don’t have the newest version of Gradle installed on your computer, you can check how to do it here. Then run the following command in a terminal from inside the project directory:

$ gradle init

Choose a Scala application with Groovy as DSL. The output should look like this:

Starting a Gradle Daemon (subsequent builds will be faster)

Select type of project to generate:
  1: basic
  2: application
  3: library
  4: Gradle plugin
Enter selection (default: basic) [1..4] 2

Select implementation language:
  1: C++
  2: Groovy
  3: Java
  4: Kotlin
  5: Scala
  6: Swift
Enter selection (default: Java) [1..6] 5

Split functionality across multiple subprojects?:
  1: no - only one application project
  2: yes - application and library projects
Enter selection (default: no - only one application project) [1..2] 1

Select build script DSL:
  1: Groovy
  2: Kotlin
Enter selection (default: Groovy) [1..2] 1

Generate build using new APIs and behavior (some features may change in the next minor release)? (default: no) [yes, no] no                           
Project name (default: interactive-job-example): 
Source package (default: interactive.job.example): 

> Task :init
Get more help with your project: https://docs.gradle.org/7.5.1/samples/sample_building_scala_applications_multi_project.html

BUILD SUCCESSFUL in 30s
2 actionable tasks: 2 executed

Now we have to add the Ilum repository and necessary dependencies into your build.gradle file. In this tutorial, we will use Scala 2.12.


dependencies {
    implementation 'org.scala-lang:scala-library:2.12.16'
    implementation 'cloud.ilum:ilum-job-api:5.0.1'
    compileOnly 'org.apache.spark:spark-sql_2.12:3.1.2'
}

Now we can create a Scala class that extends Ilum’s Job and which calculates PI:

package interactive.job.example

import cloud.ilum.job.Job
import org.apache.spark.sql.SparkSession
import scala.math.random

class InteractiveJobExample extends Job {

  override def run(sparkSession: SparkSession, config: Map[String, Any]): Option[String] = {

    val slices = config.getOrElse("slices", "2").toString.toInt
    val n = math.min(100000L * slices, Int.MaxValue).toInt
    val count = sparkSession.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x * x + y * y <= 1) 1 else 0
    }.reduce(_ + _)
    Some(s"Pi is roughly ${4.0 * count / (n - 1)}")
  }
}

If Gradle has generated some main or test classes, just remove them from the project and make a build.

$ gradle build

Generated jar file should be in './interactive-job-example/app/build/libs/app.jar', we can then switch back to Ilum. Once all pods are running, please make a port forward for ilum-ui:

kubectl port-forward svc/ilum-ui 9777:9777

Open Ilum UI in your browser and create a new group:

Put a name of a group, choose or create a cluster, upload your jar file and apply changes:

Ilum will create a Spark driver pod and you can control the number of spark executor pods by scaling them. After the spark container is ready, let’s execute the jobs:

Now we have to put the canonical name of our Scala class

interactive.job.example.InteractiveJobExample

and define the slices parameter in JSON format:

{
  "config": {
    "slices": "10"
  }
}

You should see the outcome right after the job started

You can change parameters, and rerun a job and your calculations will occur on the spot.

Interactive and single job comparison

In Ilum you can also run a single job. The most important difference compared to interactive mode is that you don’t have to implement the Job API. We can use the SparkPi jar from Spark examples:

Running a job like this is also quick, but interactive jobs are 20 times faster (4s vs 200ms). If you would like to start a similar job with other parameters, you will have to prepare a new job and upload the jar again.

Ilum and plain Apache Spark comparison

I've set up Apache Spark locally with a bitnami/spark docker image. If you would like also to run Spark on your machine, you can use docker-compose:

$ curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml
$ docker-compose up

Once Spark is running, you should be able to go to localhost:8080 and see the admin UI. We need to get the Spark URL from the browser:

Then, we have to open the Spark container in interactive mode using

$ docker exec -it  -- bash

And now inside the container, we can submit the sparkPi job. In this case, will use SparkiPi from the examples jar and, as a master parameter, put the URL from the browser:

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi\
  --master spark://78c84485d233:7077 \
  /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar\
  10

Summary

As you can see in the example above, you can avoid the complicated configuration and installation of your Spark client by using Ilum. Ilum takes over the work and provides you with a simple and convenient interface. Moreover, it allows you to overcome the limitations of Apache Spark, which can take a very long time to initialize. If you have to do many job executions with similar logic but different parameters and would like to have calculations done immediately, you should definitely use interactive job mode.

Similarities with Apache Livy

Ilum is a cloud-native tool for managing Apache Spark deployments on Kubernetes. It is similar to Apache Livy in terms of functionality - it can control a Spark Session over REST API and build a real-time interaction with a Spark Cluster. However, Ilum is designed specifically for modern, cloud-native environments.

We used Apache Livy in the past, but we have reached the point where Livy was just not suitable for modern environments. Livy is obsolete compared to Ilum. In 2018, we started moving all our environments to Kubernetes, and we had to find a way to deploy, monitor and maintain Apache Spark on Kubernetes. This was the perfect occasion to build Ilum.

Try it, it's free