Handling Spark Dependencies in Ilum
Ilum provides three methods to handle dependencies, each suited for different use cases.
1. Dedicated Docker Image (Recommended for Production)
This method involves creating a custom Docker image that includes all required dependencies. It ensures consistency across environments and is the best approach for production workloads.
Steps to Create a Custom Spark Image
- Start with the official Ilum Spark base image.
- Add necessary JARs for any Java-based dependencies.
- Install required Python packages.
- Build and push the image to a private or public registry.
- Configure Ilum to use this new image.
Example: Adding Apache Iceberg Support
Below is an example Dockerfile that builds on the Ilum Spark base image and adds support for Apache Iceberg:
FROM ilum/spark:3.5.3
USER root
# Add JARs for Iceberg support
ADD --chmod=644 https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.8.0/iceberg-spark-runtime-3.5_2.12-1.8.0.jar $SPARK_HOME/jars
# Install Python dependencies
RUN python3 -m pip install pandas pyiceberg[hive,s3fs,pandas,snappy,gcsfs,adlfs]
USER ${spark_uid}
Build and Push the Image
After writing the Dockerfile (for example, saved as Dockerfile
in the current directory), build and push the image:
docker build -t myPrivateRepo/spark:3.5.3-iceberg .
docker push myPrivateRepo/spark:3.5.3-iceberg
Configuring Ilum to Use the Custom Image
Once the image is available in a container registry, update Ilum to use this custom Spark image:
-
Edit cluster settings: Set the spark version on the ui
or here
-
Setting up the default cluster :
a) During the installation process: Add this flag to your helm install command:--set ilum-core.kubernetes.defaultCluster.config.spark\\.kubernetes\\.container\\.image="myPrivateRepo/spark:3.5.3-iceberg"
b) On preinstalled instance: In the cluster configuration on UI service set the default Spark image:
spark.kubernetes.container.image: myPrivateRepo/spark:3.5.3-iceberg
-
Per-Job Setting: When submitting a Spark job, specify the image by setting this param:
spark.kubernetes.container.image: myPrivateRepo/spark:3.5.3-iceberg
Best Practices
- Keep all dependency versions aligned with the Spark version used.
- Regularly update the custom image to include security patches and the latest dependency versions.
- Store images in a reliable and accessible container registry.
- Use a versioning scheme for your images (e.g., include Spark and feature versions in the tag).
Troubleshooting
Issue | Solution |
---|---|
Dependency mismatch | Ensure all JARs and Python packages are compatible with the Spark version in use. |
Image not found | Verify the image name and that it was pushed to the correct registry (and that Ilum has access to that registry). |
Job fails due to missing dependencies | Double-check that the Spark job is using the intended custom image (check the image configuration in Ilum or the spark-submit command). |
2. Using spark.jars.packages
and pyrequirements
(Good for Testing & PoC)
For rapid development and testing, you can add dependencies dynamically using Spark’s configuration. This approach fetches JARs and installs Python packages at startup time.
Adding Java JARs
Specify Maven coordinates for Java dependencies using the spark.jars.packages
configuration. For example, to include Apache Iceberg JAR:
a) During the installation process: Add this flag to your helm install command:
--set ilum-core.kubernetes.defaultCluster.config.spark\\.jars\\.packages="org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0"
b) At runtime: Set the jars in the cluster configuration
or during service/job creation
under Configuration/Parameters in the UI service.
spark.jars.packages: org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0
Spark will automatically download the specified package (and its dependencies) from Maven Central or the configured repository when the job starts.
Installing Python Dependencies in Ilum
Ilum provides multiple ways to install Python dependencies for Spark jobs and Jupyter sessions. Depending on your use case, you can choose between:
- Using the Ilum UI for Spark Jobs
- Defining
pyRequirements
in Jupyter Sessions - Setting Default Python Packages via Helm Configuration
1. Installing Python Packages via Ilum UI
Ilum makes it easy to add Python dependencies when creating Spark jobs directly from the UI.
- Navigate to the Add Job + section in the Ilum UI.
- Locate the Requirements field under the
Resources
tab. - Enter the required Python dependencies.
- Ilum will install these dependencies at runtime before executing the Spark job.
2. Setting Python Dependencies in Jupyter Sessions (pyRequirements
)
For Jupyter users, Ilum provides a way to specify required packages when starting a session:
- In the Jupyter Notebook session creation form (%manage_spark), locate the pyRequirements field.
- Enter the packages as a semicolon-separated list:
pandas;numpy;openai
- When the Jupyter session starts, Ilum will automatically install these dependencies.
From the version 6.3.1 there is a special section for additional python libraries.
3. Configuring Default Python Packages via Helm (For Persistent Settings)
To define default packages for all Jupyter Spark sessions, you can :
a) During the installation process: Add this flag to your helm install command:
--set ilum-jupyter.sparkmagic.config.sessionConfigs.conf='{"pyRequirements":"pandas;numpy;openai"}'
b) After the installation process: Modify the ilum-jupyter-config
configMap to include the required packages:
data:
config.json: |
...
{
"session_configs": {
"conf": { "pyRequirements": "pandas;numpy;openai", "cluster": "default", "autoPause": "false", "spark.example.config": "You can change the default configuration in ilum-jupyter-config k8s configmap" },
"driverMemory": "1000M",
"executorCores": 2
}
}
After applying this configuration, every Jupyter session launched will automatically include these dependencies without requiring manual installation.
Each approach ensures your Spark jobs and Jupyter sessions have the necessary dependencies installed, so you can focus on data engineering and analysis instead of managing environments. 🚀
Best Practices
- Use this method for testing or proof-of-concept jobs; avoid it for production due to the overhead of downloading dependencies on each run.
- Always specify exact versions for packages to ensure reproducibility.
- Combine this approach with custom Docker images for better consistency (e.g., use Docker for core dependencies and
spark.jars.packages
for a few transient ones if needed). - Be mindful of network access and performance, as downloading packages can slow down startup times.
Troubleshooting
Issue | Solution |
---|---|
JAR not found | Ensure the Maven coordinates (groupId, artifactId, version) are correct and the package exists in the repository. |
Startup Performance overhead | Remember that this method fetches dependencies at runtime. If the startup is slow, or if you encounter an OOM error or a timeout, consider baking dependencies into a Docker image for faster startups. |
3. Installing Libraries in Jupyter Notebooks with pip install
For quick interactive experiments, you can install libraries within a Jupyter notebook using pip. This is a fast way to test something in an ad-hoc manner, but it is not recommended for anything beyond temporary exploration.
Example
If you are running a Spark session in an Ilum Jupyter notebook and need a new Python package, you can install it like so:
%%spark
import subprocess
result = subprocess.check_output(["pip", "install", "geopandas"])
print(result.decode())
This will install the package in the notebook’s environment so you can use it immediately.
%%spark
import subprocess
result = subprocess.check_output(["pip", "list"])
print(result.decode())
Why It’s Not Recommended
- Packages installed this way are only available in the current spark session.
- The environment does not persist across session restarts or new sessions.
- It can lead to inconsistencies between your development environment and the production Spark runtime.
Best Practices
- Use this approach only for quick, throwaway prototyping.
- If you find yourself relying on a pip-installed library, add it to a requirements file or Docker image for permanence.
- Document any packages you had to install in the notebook so you can update your environment properly later.
Troubleshooting
Issue | Solution |
---|---|
Package not found | Make sure you spelled the package name correctly and that it’s available on PyPI (the Python Package Index). |
Module not found after installation | If you installed a package but cannot import it, try restarting the notebook kernel to reload the environment with the new package. |
Final Recommendations
- Production workloads: Use a custom Docker image with all dependencies pre-installed. This yields a stable and reproducible environment with faster startup times.
- Testing or prototyping: Use
spark.jars.packages
and apyrequirements.txt
for flexibility. This allows you to experiment quickly without building a new image, though it may incur startup overhead. - Interactive experiments: Installing via Jupyter notebooks is convenient for short-lived experiments, but always transition to a more robust solution (Docker image or requirements file) for anything that needs to be saved or run again.
By following these practices, you can efficiently manage Spark dependencies in Ilum while minimizing compatibility issues and runtime errors.