Skip to main content

Apache Spark Connect on Ilum: Configuration and Connection Guide

What is Spark Connect?

Spark Connect is a modern client-server interface for Apache Spark that enables remote execution of Spark workloads from lightweight clients such as Python, Java, Scala, R, and SQL-based tools. Introduced in Spark 3.4, Spark Connect decouples the Spark client from the Spark runtime, allowing developers to build interactive data applications, notebooks, and dashboards without deploying the full Spark engine locally.

Spark Connect leverages gRPC-based communication to interact with a remote Spark server, offering flexibility, improved security, and simplified infrastructure for data engineering, data science, and analytics workflows.

It is very similar to Ilum’s approach to Spark microservices, where Spark components are containerized and exposed as dynamic services. The design used in Deploying PySpark Microservice on Kubernetes : both enabling scalable, stateless, and secure access to Spark without full cluster setup on the client side.

Why use Spark Connect on Kubernetes?

Traditional Spark submission often requires complex local setups (Java, Hadoop binaries, exact Spark versions). Spark Connect eliminates this "dependency hell."

FeatureTraditional Spark Submission (spark-submit)Spark Connect
ArchitectureMonolithic (Driver runs on client or cluster edge)Decoupled (Client is separate from Server)
Client RequirementsHeavy (Requires Java, Spark binaries, Hadoop configs)Lightweight (Only Python/Go/Scala library required)
Network ProtocolCustom RPC (Sensitive to version mismatch)gRPC (Standard, version-agnostic, firewall-friendly)
Iteration SpeedSlow (Build & Deploy jars)Fast (Interactive, REPL-style development)
Language Supportlimited to JVM/PythonPolyglot (Python, Scala, Go, Rust, etc.)
Feature Overview

For a deeper dive into how Ilum leverages this for multi-tenancy, see our Architecture Documentation.

In Ilum, Spark Connect aligns naturally with our microservice-based Spark architecture. You can deploy a Spark Connect server as a standard job and access it through various connection methods, using pod name, pod IP, or a service exposed via Kubernetes.


Prepare Your Client Environment

Before connecting, you need a lightweight client library. Unlike traditional Spark, you do not need a local JVM or Hadoop installation.

Python (PySpark)

Install PySpark with Connect support (Spark 4 / Python 3.10+)
pip install pyspark[connect]==4.0.1 grpcio-status

Scala (sbt)

For Scala applications, add the Spark Connect client dependency:

libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "4.0.1"

Spark SQL CLI

You can also use the generic Spark SQL CLI to connect remotely:

/path/to/spark/bin/spark-sql --remote "sc://<ilum-cluster-address>:15002"

Note: Always match your client library version (e.g., 4.0.1; fallback 3.5.7) with the Spark version running on your Ilum cluster.


Creating Apache Spark Connect Instance via Ilum UI

Follow these steps to launch a Spark Connect server as a job on your Ilum cluster using the web UI:

  1. Start a New Spark Job: Log in to the Ilum UI and navigate to the Jobs section. Click on New Job to create a new Spark job.

  2. Job Name: Enter a recognizable name for the job (e.g., Spark Connect Server) to identify it later in the UI.

  3. Main Class: Set the job's main class to:

    org.apache.spark.sql.connect.service.SparkConnectServer

    This is the built-in Spark class that starts the Spark Connect server process, enabling remote connectivity to Spark clusters.

  4. Spark Configuration: Go to the Configuration tab/section for the job. Add the following Spark property to ensure the Spark Connect server code is available:

Key: spark.jars.packages

Value: org.apache.spark:spark-connect_2.13:4.0.1

This configuration instructs Spark to fetch the Spark Connect library from Maven when the job starts.

  1. (Optional) Label the Pod: If you plan to expose this Spark Connect server via a Kubernetes Service, add a label to the Spark driver pod:

    • Key: spark.kubernetes.driver.label.type
    • Value: sparkconnect

    This will tag the Spark Connect server's pod with a label type=sparkconnect for easy service selection.

  2. Submit the Job: Click Submit. Ilum will deploy the Spark job to the cluster. After a short time, you should see the job in the running jobs list.

  3. Verify the Server is Running: Wait for the job status to become "Running". You can check the job's logs for a message indicating Spark Connect has started (e.g., a log line mentioning port 15002). Once running, the Spark Connect server is listening for client connections on the default port 15002.

Job form with Spark Connect option

Spark Connect server link

Missing Spark Connect Dependency?

If your job fails immediately, ensure you added spark.jars.packages with the correct version.


Connecting to the Spark Connect Server

Once the Spark Connect server is running, you can connect to it from a Spark client (e.g., PySpark, Spark shell, sparklyr, etc.) using the Spark Connect URL (sc://...). Below are different connection methods depending on your network setup:

If your environment allows DNS resolution of pod names (for example, your client is within the cluster or can resolve the cluster's internal DNS), you can connect using the pod's DNS name. Kubernetes assigns each pod a DNS name of the form <pod-name>.<namespace>.pod.cluster.local (Kubernetes DNS). This DNS name resolves to the pod's IP address inside the cluster.

Steps:

  • Find the Pod Name: In the Ilum UI, locate the Spark Connect job you started. Note the driver pod name (Ilum may show it in the job details or logs). It will be something like job-xxxxxx-driver (the exact format may vary).
  • Construct the URL: Use the pod's fully qualified DNS name. For example, if the pod name is job-abc123-driver in the default namespace, the address would be:
    sc://job-abc123-driver.default.pod.cluster.local:15002
  • Connect via Spark Client: Use this URL in your SparkSession builder or Spark shell. For example, in PySpark you can do:
    notebook.ipynb
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.remote(
    "sc://job-abc123-driver.default.pod.cluster.local:15002"
    ).getOrCreate()

This will create a Spark session that connects remotely to the Spark Connect server at the given DNS address. Ensure that your environment's DNS can resolve .pod.cluster.local addresses (usually only true if running within the cluster or via VPN to the cluster network).

Connection via Pod DNS

tip

Note: This is crucial for managing your Apache Spark applications. If your client is running inside the same namespace in the cluster, you might not need the full domain. For instance, just sc://job-abc123-driver:15002 could work due to Kubernetes' DNS search path. However, using the full pod.cluster.local address with namespace is the most explicit and reliable approach.


Cleanup Tasks

After you are done with your Spark Connect session(s), perform the following cleanup steps to free resources and avoid orphaned connections:

  1. Stop the Spark Connect Job: In the Ilum UI, navigate to the running Spark Connect job and click Stop or Terminate. This will shut down the Spark Connect server process on the cluster. Confirm that the job's status changes to stopped/finished. (If you forget this step, the Spark Connect server will keep running and occupying cluster resources, impacting your spark application performance.)

  2. Terminate Port-Forward Sessions: If you used kubectl port-forward, go to the terminal where it's running and press Ctrl+C to end the port-forwarding. This closes the tunnel and frees up your local port. If you ran port-forward in the background, make sure to kill that process.

  3. Delete Kubernetes Service (if created): If you exposed a Service for Spark Connect, remove it when it's no longer needed. You can delete it with:

    kubectl delete service spark-connect-service -n default

    Replace spark-connect-service and namespace as appropriate. This ensures you don't leave an open network endpoint in the cluster. (If you set up a LoadBalancer, deleting the Service will also release the external IP/port. If you used a NodePort, it frees that port on the nodes for other uses.)

By cleaning up, you ensure no stray processes or ports are left open related to your Spark Connect usage, optimizing resources on your spark cluster.


Troubleshooting Spark Connect Issues

Here are solutions to the most common errors when connecting to Spark on Kubernetes.

How to fix "Connection Refused" on port 15002?

If your client fails with ConnectionRefusedError or UNAVAILABLE:

Cause: The client cannot reach the Spark Driver pod. This is usually a networking issue, not a Spark issue.

Solution:

  1. Check Job Status: Is the job actually RUNNING in the Ilum UI?
  2. Check Network Access:
    • If you are outside the cluster (e.g., local laptop), you cannot use the Pod IP directly. You must use kubectl port-forward (Method 3) or a NodePort/LoadBalancer Service (Method 4).
  3. Verify Port: Ensure you are connecting to 15002 (Spark Connect), not 4040 (Spark UI).
  4. Test Connection: Run nc -vz localhost 15002 (if using port-forward).
How to resolve "Name or service not known" (DNS Error)?

Cause: Your local machine doesn't know how to resolve Kubernetes internal DNS names like job-xyz.default.pod.cluster.local.

Solution:

  • Option A: Use kubectl port-forward and connect to sc://localhost:15002.
  • Option B: Connect using the Pod IP directly (only works if you are on the same VPN/VPC).
  • Option C: Configure your local /etc/hosts to point the DNS name to 127.0.0.1 (combined with port forwarding).
How to fix "Pod not found" during port-forwarding?

Cause: Spark Driver pods are ephemeral. If you restart the job, the pod name changes (e.g., from job-abc-driver to job-xyz-driver).

Solution:

  • Always check the current driver pod name in the Ilum UI or via kubectl get pods -l spark-role=driver.
  • Use a Service (Method 4) to get a stable hostname that doesn't change between restarts.
Error: "Client version mismatch" or "Unsupported Protocol"

Cause: You are trying to connect a Spark 3.4 client to a Spark 3.5 server (or vice versa).

Solution: Check your client version:

pip show pyspark

It must match the Ilum cluster version (e.g., both must be 3.5.x).

Error: "ModuleNotFoundError: No module named 'grpc'"

Cause: The grpcio-status library is missing. It is a required optional dependency for Spark Connect.

Solution:

pip install grpcio-status

By following this guide, you should be able to configure a Spark Connect server on Ilum and connect to it through various methods. The Ilum UI makes it easy to deploy the Spark Connect instance, and with the above techniques, you can access it whether you are inside the Kubernetes cluster or working remotely. Happy connecting!