Skip to main content

Handling Spark Dependencies in Ilum

Ilum provides three methods to handle dependencies, each suited for different use cases.

This method involves creating a custom Docker image that includes all required dependencies. It ensures consistency across environments and is the best approach for production workloads.

Steps to Create a Custom Spark Image

  1. Start with the official Ilum Spark base image.
  2. Add necessary JARs for any Java-based dependencies.
  3. Install required Python packages.
  4. Build and push the image to a private or public registry.
  5. Configure Ilum to use this new image.

Example: Adding Apache Iceberg Support

Below is an example Dockerfile that builds on the Ilum Spark base image and adds support for Apache Iceberg:

FROM ilum/spark:3.5.3

USER root

# Add JARs for Iceberg support
ADD --chmod=644 https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.8.0/iceberg-spark-runtime-3.5_2.12-1.8.0.jar $SPARK_HOME/jars

# Install Python dependencies
RUN python3 -m pip install pandas pyiceberg[hive,s3fs,pandas,snappy,gcsfs,adlfs]

USER ${spark_uid}

Build and Push the Image

After writing the Dockerfile (for example, saved as Dockerfile in the current directory), build and push the image:

docker build -t myPrivateRepo/spark:3.5.3-iceberg .
docker push myPrivateRepo/spark:3.5.3-iceberg

Configuring Ilum to Use the Custom Image

Once the image is available in a container registry, update Ilum to use this custom Spark image:

  • Edit cluster settings: Set the spark version on the ui

    Set spark version on spark cluster settings

or here

Set spark version on spark cluster properties section

  • Setting up the default cluster :
    a) During the installation process: Add this flag to your helm install command:

    --set ilum-core.kubernetes.defaultCluster.config.spark\\.kubernetes\\.container\\.image="myPrivateRepo/spark:3.5.3-iceberg"

    b) On preinstalled instance: In the cluster configuration on UI service set the default Spark image:

    spark.kubernetes.container.image: myPrivateRepo/spark:3.5.3-iceberg
  • Per-Job Setting: When submitting a Spark job, specify the image by setting this param:

    spark.kubernetes.container.image: myPrivateRepo/spark:3.5.3-iceberg

Best Practices

  • Keep all dependency versions aligned with the Spark version used.
  • Regularly update the custom image to include security patches and the latest dependency versions.
  • Store images in a reliable and accessible container registry.
  • Use a versioning scheme for your images (e.g., include Spark and feature versions in the tag).

Troubleshooting

IssueSolution
Dependency mismatchEnsure all JARs and Python packages are compatible with the Spark version in use.
Image not foundVerify the image name and that it was pushed to the correct registry (and that Ilum has access to that registry).
Job fails due to missing dependenciesDouble-check that the Spark job is using the intended custom image (check the image configuration in Ilum or the spark-submit command).

2. Using spark.jars.packages and pyrequirements (Good for Testing & PoC)

For rapid development and testing, you can add dependencies dynamically using Spark’s configuration. This approach fetches JARs and installs Python packages at startup time.

Adding Java JARs

Specify Maven coordinates for Java dependencies using the spark.jars.packages configuration. For example, to include Apache Iceberg JAR:

a) During the installation process: Add this flag to your helm install command:

  --set ilum-core.kubernetes.defaultCluster.config.spark\\.jars\\.packages="org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0"

b) At runtime: Set the jars in the cluster configuration or during service/job creation under Configuration/Parameters in the UI service.

  spark.jars.packages: org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0

Spark will automatically download the specified package (and its dependencies) from Maven Central or the configured repository when the job starts.

Installing Python Dependencies in Ilum

Ilum provides multiple ways to install Python dependencies for Spark jobs and Jupyter sessions. Depending on your use case, you can choose between:

  1. Using the Ilum UI for Spark Jobs
  2. Defining pyRequirements in Jupyter Sessions
  3. Setting Default Python Packages via Helm Configuration

1. Installing Python Packages via Ilum UI

Ilum makes it easy to add Python dependencies when creating Spark jobs directly from the UI.

Set python dependencies in spark job

  • Navigate to the Add Job + section in the Ilum UI.
  • Locate the Requirements field under the Resources tab.
  • Enter the required Python dependencies.
  • Ilum will install these dependencies at runtime before executing the Spark job.

2. Setting Python Dependencies in Jupyter Sessions (pyRequirements)

For Jupyter users, Ilum provides a way to specify required packages when starting a session:

Set spark job python dependencies in jupyter

  • In the Jupyter Notebook session creation form (%manage_spark), locate the pyRequirements field.
  • Enter the packages as a semicolon-separated list:
    pandas;numpy;openai
  • When the Jupyter session starts, Ilum will automatically install these dependencies.

From the version 6.3.1 there is a special section for additional python libraries.

New sparkmagic spark form


3. Configuring Default Python Packages via Helm (For Persistent Settings)

To define default packages for all Jupyter Spark sessions, you can :

a) During the installation process: Add this flag to your helm install command:

  --set ilum-jupyter.sparkmagic.config.sessionConfigs.conf='{"pyRequirements":"pandas;numpy;openai"}'

b) After the installation process: Modify the ilum-jupyter-config configMap to include the required packages:

data:
config.json: |
...
{
"session_configs": {
"conf": { "pyRequirements": "pandas;numpy;openai", "cluster": "default", "autoPause": "false", "spark.example.config": "You can change the default configuration in ilum-jupyter-config k8s configmap" },
"driverMemory": "1000M",
"executorCores": 2
}
}

After applying this configuration, every Jupyter session launched will automatically include these dependencies without requiring manual installation.

Each approach ensures your Spark jobs and Jupyter sessions have the necessary dependencies installed, so you can focus on data engineering and analysis instead of managing environments. 🚀

Best Practices

  • Use this method for testing or proof-of-concept jobs; avoid it for production due to the overhead of downloading dependencies on each run.
  • Always specify exact versions for packages to ensure reproducibility.
  • Combine this approach with custom Docker images for better consistency (e.g., use Docker for core dependencies and spark.jars.packages for a few transient ones if needed).
  • Be mindful of network access and performance, as downloading packages can slow down startup times.

Troubleshooting

IssueSolution
JAR not foundEnsure the Maven coordinates (groupId, artifactId, version) are correct and the package exists in the repository.
Startup Performance overheadRemember that this method fetches dependencies at runtime. If the startup is slow, or if you encounter an OOM error or a timeout, consider baking dependencies into a Docker image for faster startups.

3. Installing Libraries in Jupyter Notebooks with pip install

For quick interactive experiments, you can install libraries within a Jupyter notebook using pip. This is a fast way to test something in an ad-hoc manner, but it is not recommended for anything beyond temporary exploration.

Example

If you are running a Spark session in an Ilum Jupyter notebook and need a new Python package, you can install it like so:

  %%spark

import subprocess

result = subprocess.check_output(["pip", "install", "geopandas"])
print(result.decode())

This will install the package in the notebook’s environment so you can use it immediately.

  %%spark

import subprocess

result = subprocess.check_output(["pip", "list"])
print(result.decode())
  • Packages installed this way are only available in the current spark session.
  • The environment does not persist across session restarts or new sessions.
  • It can lead to inconsistencies between your development environment and the production Spark runtime.

Best Practices

  • Use this approach only for quick, throwaway prototyping.
  • If you find yourself relying on a pip-installed library, add it to a requirements file or Docker image for permanence.
  • Document any packages you had to install in the notebook so you can update your environment properly later.

Troubleshooting

IssueSolution
Package not foundMake sure you spelled the package name correctly and that it’s available on PyPI (the Python Package Index).
Module not found after installationIf you installed a package but cannot import it, try restarting the notebook kernel to reload the environment with the new package.

Final Recommendations

  • Production workloads: Use a custom Docker image with all dependencies pre-installed. This yields a stable and reproducible environment with faster startup times.
  • Testing or prototyping: Use spark.jars.packages and a pyrequirements.txt for flexibility. This allows you to experiment quickly without building a new image, though it may incur startup overhead.
  • Interactive experiments: Installing via Jupyter notebooks is convenient for short-lived experiments, but always transition to a more robust solution (Docker image or requirements file) for anything that needs to be saved or run again.

By following these practices, you can efficiently manage Spark dependencies in Ilum while minimizing compatibility issues and runtime errors.