Skip to main content

Spark Connect Server

🔗 What is Spark Connect?​

Spark Connect is a client-server architecture for Apache Spark that allows you to run Spark jobs remotely. Introduced in Spark 3.4, it decouples the client application from the Spark cluster, enabling you to interact with Spark from lightweight applications written in Python, Java, Scala, R, or even from SQL-based tools.

With Spark Connect, developers can build interactive data applications, notebooks, and dashboards without needing a full Spark installation on their local machines. It uses a gRPC-based protocol for efficient and modern communication between the client and the remote Spark server.

This approach offers several key advantages:

  • Stability: The client and server are isolated. A memory-intensive client application won't crash the Spark driver.
  • Upgradability: You can upgrade the Spark cluster independently of your client applications.
  • Security: It provides a secure endpoint for accessing a remote cluster.
  • Reach: You can connect to Spark from anywhere, including IDEs, notebooks, and applications running outside the cluster's network.

This architecture is very similar to Ilum’s approach to Spark microservices, where Spark components are containerized and exposed as dynamic services. As described in Deploying PySpark Microservice on Kubernetes, both designs enable scalable, stateless, and secure access to Spark without requiring a full cluster setup on the client side.

🧱 Architecture Overview​

Spark Connect introduces a decoupled architecture that separates the client from the Spark driver and cluster. Here’s how it works:

  • The Client: Your application (e.g., a Python script or a Jupyter notebook) uses the Spark Connect client library. When you write DataFrame operations, the client translates them into unresolved logical query plans. These plans are then sent to the Spark Connect server using gRPC.
  • The Server: The server is a component running within the Spark driver on the cluster. It receives the logical plans, uses Spark’s Catalyst optimizer to create an optimized physical execution plan, and then executes the job on the cluster. The results are streamed back to the client.

This separation provides better control, security, and scalability for distributed data processing, especially in modern cloud-native and multi-tenant environments.

In Ilum, Spark Connect aligns naturally with our microservice-based architecture. You can deploy a Spark Connect server as a standard Ilum job and access it through various connection methods, such as the pod name, pod IP, or a Kubernetes service.


Creating a Spark Connect Job in Ilum​

To create a Spark Connect job in Ilum, select the Spark Connect Job option from the New Job form.

Job form with Spark Connect option Choosing the Spark Connect job type automatically populates the required configuration

Missing Spark Connect Dependency?

While Ilum pre-fills the necessary job configuration, it does not verify that your Docker image contains the Spark Connect server.

If your job fails with an error similar to this, it means the Spark Connect dependency is missing from your Spark distribution:

25/08/07 15:41:12 ERROR SparkApplicationCreator$: Failed to load class org.apache.spark.sql.connect.service.SparkConnectServer: org.apache.spark.sql.connect.service.SparkConnectServer
25/08/07 15:41:12 ERROR SingleEntrypoint: Exception occurred during job execution
org.apache.spark.SparkUserAppException: User application exited with 101

If your cluster has an internet connection, you can resolve this by adding the Spark Connect package via Spark configuration. In the Parameters section of the job form, add:

spark.jars.packages: org.apache.spark:spark-connect_<scala-version>:<spark-version>

Be sure to replace <scala-version> (e.g., 2.12) and <spark-version> (e.g., 3.5.6) with the versions that match your environment.

The server starts successfully when you see the following line in the driver logs:

25/08/07 16:00:03 INFO SparkConnectServer: Spark Connect server started at: 0:0:0:0:0:0:0:0%0:15002

Spark Connect server link After the job starts, you can find the Spark Connect server URL on the job details page

Once the Spark Connect server is running, you can connect to it from any Spark client using the URL provided by Ilum.

Connecting from within the Kubernetes Cluster​

If your client application is running in the same Kubernetes cluster, you can use the provided URL directly.

For example, to start a PySpark shell:

pyspark --remote <your-URL>

Or, to connect from a Python script:

from pyspark.sql import SparkSession

spark = SparkSession.builder.remote("<your-URL>").getOrCreate()

# Now you can use Spark as usual
df = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "id"])
df.show()

Connecting from Outside the Cluster​

For local development, you can connect to the Spark Connect server from your local machine. The easiest way to achieve this is by forwarding the server’s port using kubectl.

First, you need the name of the driver pod. The pod name is typically the hostname part of the Spark Connect URL, but without the -svc suffix. For example, if the URL is sc://job-20250807-1557-ablr2a52vxd-e5282f9885429661-driver-svc:15002, the driver pod name is likely job-20250807-1557-ablr2a52vxd-e5282f9885429661-driver.

You can confirm the exact pod name by navigating to the Logs tab in the Ilum UI.

Driver pod name in logs The driver pod name highlighted in the logs tab

With the driver pod name, run the following command in your terminal to forward port 15002:

kubectl port-forward <driver-pod-name> 15002:15002

This command forwards traffic from localhost:15002 on your machine to the Spark Connect server port inside the cluster. You should be able to connect to the Spark instance using the local URL sc://localhost:15002.