Run Apache Spark Jobs via Ilum UI
Running an Apache Spark job on Kubernetes with Ilum operates just like one submitted via spark-submit, but with additional enhancements for ease of use, configuration, and integration with external tools.
You can use the jar file with spark examples from one of these links:
- Spark 4 (default)
- Spark 3
Spark 4 / Scala 2.13: spark-examples_2.13-4.1.1.jar
Spark 3 / Scala 2.12: spark-examples_2.12-3.5.7.jar
Interactive Spark Job Submission Guide
Here's a step-by-step guide to setting up a simple Spark job using Ilum. This guide will walk you through configuring, executing, and monitoring a basic job named MiniReadWriteTest within the Ilum platform.
Step-by-Step Tutorial: Running Your First Spark Job
-
Navigate to the Jobs Section: This area allows you to manage all your data processing tasks.
-
Create a New Job:
- Click on the ‘New Job +’ button to start the setup process.
-
Fill Out Job Details:
-
General Tab:
- Name: Enter
MiniReadWriteTest - Job Type: Select
Spark Job - Class: Enter
org.apache.spark.examples.MiniReadWriteTest - Language: Select
Scala
- Name: Enter
-
Configuration Tab:
- Arguments: Enter
/opt/spark/examples/src/main/resources/kv1.txt
This path specifies a local file to be distributed to executors, a test file available in every Spark environment.
- Arguments: Enter
-
Resources Tab:
- Jars: Upload the JAR file:
-
- Spark 4 (default)
- Spark 3
Spark 4 / Scala 2.13: spark-examples_2.13-4.1.1.jar
Spark 3 / Scala 2.12: spark-examples_2.12-3.5.7.jar
- Memory Tab:
- Leave all settings at their default values for this example.
-
Submit and Monitor the Job:
- Submit the job.
- Navigate to the logs section to review logs from each executor.
- You should see log output showing the job execution, including:
- Spark initialization messages (
SparkContext: Running Spark version 3.5.7) - File reading and word count operations (
Performing local word count from /opt/spark/examples/src/main/resources/kv1.txt) - Task execution across executors (
Starting task 0.0 in stage 0.0) - Final success message (
Success! Local Word Count 500 and D Word Count 500 agree.)
- Spark initialization messages (
-
Review Job Execution:
- Once the job has started, check the status in the job overview section.
- Monitor the memory usage and other performance metrics in the executors section.
- Observe the progress of your job through each stage on the timeline.
-
Completion and Review:
- Upon completion, the job details and results are logged into the Spark history server.
- Visit the history server section to see your completed job and review detailed execution stages.
-
Final Step:
- Congratulations! You have successfully set up and run your MiniReadWriteTest job in Ilum. For further information or support, contact [email protected].
To submit jobs programmatically instead of using the UI, see the Run Spark Job via REST API guide.
Congratulations! You have successfully set up and run your MiniReadWriteTest job in Ilum. For further information or support, contact [email protected].
By following these steps, you'll be able to efficiently set up, run, and monitor a basic Spark job within the Ilum platform, gaining familiarity with its functionalities and preparing you for more complex data processing tasks.
Here's a consolidated explanation of how Ilum facilitates Spark job submissions, blending the traditional features of spark-submit with Ilum's advanced management capabilities:
Loading example job
Ilum provides an example job to help new users get started quickly.
Example job loading is enabled by default. However, you can disable it by using --set ilum-core.examples.job=false.
Why Ilum is a Better Alternative to spark-submit
-
Universal Compatibility: Ilum enables the submission of any Spark job, akin to using
spark-submit. It supports various programming languages used with Spark, including Scala, Python, and R, catering to all typical Spark operations like batch processing, streaming jobs, or interactive queries. -
Simplified Command Execution: While
spark-submitoften involves complex command-line inputs for library dependencies, job parameters, and cluster configurations, Ilum abstracts these into an intuitive user interface. This approach minimizes error risks and simplifies operations, especially beneficial for those less familiar with command-line intricacies. -
Direct Code Deployment: Users can upload their JAR files, Python scripts, or notebooks directly into Ilum, similar to specifying resources in a
spark-submitcommand. Ilum enhances this by allowing these resources to be configured for scheduled or event-triggered executions, providing greater operational flexibility. -
Automated Environment Handling: Unlike the manual setup required with
spark-submit, Ilum ensures all dependencies and configurations are automatically managed. This guarantees that the execution environment is consistently prepared for job execution, whether on local clusters, cloud, or hybrid setups. -
Integrated Monitoring and Tooling: Ilum comes with built-in integration for monitoring and logging tools, which in the
spark-submitworkflow would require additional setup. This integration provides users with ready-to-use solutions for tracking job performance, managing logs, and connecting with other data services seamlessly.
Enhanced Job Submission Experience
Ilum not only matches the capabilities of spark-submit but extends them by reducing the overhead associated with job configuration and environmental setup. It offers an all-encompassing platform that simplifies the deployment, management, and scaling of Spark jobs, making it an ideal solution for organizations aiming to enhance their data processing workflows without compromising the power and flexibility of Apache Spark.
Job Configuration Reference
- General
- Configuration
- Resources
- Memory
| Parameter | Description |
|---|---|
Name | A unique identifier for the job. This name is used in the dashboard and logs to track the job's execution and history. |
Job Type | The category of the job to be created. Select Spark Job for standard batch processing or Spark Connect Job for client-server Spark applications. |
Cluster | The target cluster where the job will be executed. Choose a cluster that has the necessary resources and data access for your job. |
Class | The fully qualified class name of the application (e.g., org.apache.spark.examples.SparkPi) or the filename for Python scripts. This tells Spark which code to execute as the entry point. |
Language | The programming language used for the job. Select Scala or Python to match your application code. |
Max Retries | The maximum number of times Ilum will attempt to restart the job if it fails. Setting this helps ensure job completion in case of transient errors. |
| Parameter | Description |
|---|---|
Parameters | Key-value pairs for configuring Spark properties (e.g., spark.executor.instances). These settings allow you to fine-tune the Spark environment for this specific job. |
Arguments | Command-line arguments passed directly to the job's main method. Use these to provide dynamic inputs or configuration flags to your application logic. |
Tags | Custom labels used to categorize and filter jobs in the UI. Tags are helpful for organizing jobs by project, team, or purpose (e.g., production, etl). |
| Parameter | Description |
|---|---|
Jars | Additional JAR files to be included in the classpath of the driver and executors. These are necessary if your job relies on external libraries not present in the base image. |
Files | Auxiliary files to be placed in the working directory of each executor. These are often used for configuration files or small datasets required by the job. |
PyFiles | Python dependencies such as .zip, .egg, or .py files. These are added to the PYTHONPATH to ensure Python jobs have access to required modules. |
Requirements | A list of additional Python packages to install on the nodes before execution. This ensures the runtime environment matches your development environment. |
Spark Packages | Maven coordinates for Spark JAR packages to be downloaded and included. This is a convenient way to include libraries from Maven Central without manually uploading JARs. |
| Parameter | Description |
|---|---|
Executors | The number of executor instances to launch for this job. Increasing this number allows for greater parallelism and faster processing of large datasets. |
Driver Cores | The number of CPU cores allocated to the driver process. More cores can help the driver manage task scheduling and result collection more efficiently. |
Executor Cores | The number of CPU cores allocated to each executor. This determines the number of concurrent tasks each executor can handle. |
Driver Memory | The amount of RAM allocated to the driver (e.g., 2g). Sufficient memory is required for the driver to maintain application state and handle large results. |
Executor Memory | The amount of RAM allocated to each executor (e.g., 4g). This directly affects how much data can be cached and processed in memory on each node. |
Dynamic Allocation | Enables automatic scaling of the number of executors based on the current workload. This helps optimize resource usage by requesting more executors when needed and releasing them when idle. |
Initial Executors | The initial number of executors to start with when dynamic allocation is enabled. This provides a baseline capacity when the job starts. |
Minimal number of executors | The lower bound for the number of executors when dynamic allocation is enabled. The job will never scale below this number. |
Maximal number of executors | The upper bound for the number of executors when dynamic allocation is enabled. This prevents the job from consuming too many cluster resources. |