Overview
Ilum enables you to build complex, modern data architectures ten times faster, even with a limited budget. It integrates multiple tools for data organization, visualization, job orchestration, monitoring, and more. This section offers a brief overview of the tools Ilum provides and guides you through the documentation to learn more about them.
Ilum Jobs
Ilum Jobs and Ilum Groups are features designed to simplify the management and creation of all your Spark applications on the cluster.
These features achieve this by wrapping your Apache Spark applications, written in Python or Scala, into Ilum Applications. These applications communicate with the server, Ilum Core, using gRPC or Apache Kafka (depending on your configuration). In turn, the server interacts with the cluster API.
This setup offers several benefits:
- Effortless Job Management: Easily create, delete, clone, stop, and resume jobs with a single click from Ilum UI. The server automatically generates the appropriate spark-submit commands and launches the provided applications.
- Comprehensive Monitoring: Track Spark application logs, CPU and memory usage, and view the stages and task structures of your Spark jobs.
- Automated Configuration: Eliminate the need for manual setup, as Ilum Jobs integrates seamlessly with all the tools you enable.
- Session Reusability with Ilum Groups: Avoid repeatedly recreating Spark sessions. Ilum Groups allow multiple applications to share the same Spark session, improving efficiency and streamlining workflows.
- Orchestration: Use Schedule to launch your jobs according to a specified schedule. Use the Ilum Core API for more advanced jobs orchestration.
Additionally, with Ilum Code Groups, you can write and launch your code directly from a web interface, while still accessing all the integrated tools.
To learn more about Ilum Jobs, Ilum Groups, and Ilum Code Groups, feel free to visit the user guides:
If you are interested in job orchestration, you might want to explore Ilum Schedule, which enables you to create schedules for launching your Spark applications: Schedule
Additionally, for more advanced orchestration, you can refer to the Ilum Core API Reference and API
Notebooks
Jupyter and Zeppelin are advanced development environments widely used by data scientists today.
Ilum integrates both technologies, enabling you to execute Jupyter code through Ilum Jobs and Ilum Groups using Spark Magic.
To learn more about using notebooks in Ilum and their integration, visit the Notebooks Documentation page.
Centralized Storages and Cluster management
As your architecture grows, you may encounter situations where multiple storage solutions and clusters need to be integrated. In such cases, you might face repetitive tasks, such as setting up all the tools again on new clusters or linking individual storage systems to your Spark applications. Additionally, managing multiple clusters can become challenging, as it requires sharing access certificates with each user for every cluster.
Ilum addresses these challenges by enabling you to manage all your clusters from a single, centralized control plane. All you need to do is grant Ilum access to the cluster once and establish networking between Ilum Tools and the remote cluster.
To learn more about Ilum’s capabilities for cluster and storage management, visit Clusters and Storages
For step-by-step guidance on adding storage solutions and clusters to Ilum, refer to these user guides:
Data organization
Storing data in plain CSV files can lead to significant risks and inefficiencies. To address these challenges, Ilum simplifies the integration of advanced data formats like Delta, Hudi, and Iceberg, which offer:
- ACID compliance for reliable transaction management.
- Support for update, delete, and merge operations.
- Schema evolution, allowing for seamless table alterations.
- Versioning, enabling time travel to access previous versions of datasets.
- Enhanced optimization compared to traditional file formats.
- Integration of these features for both streaming and batch processing
Additionally, Ilum introduces its own data format, Ilum Tables, which allows you to leverage the capabilities of Delta, Hudi, and Iceberg using the same code. To learn more about Ilum Tables, visit the documentation page.
Ilum integrates seamlessly with Hive Metastore, a crucial component for organizing and managing metadata in modern data infrastructure. By using Hive Metastore, you can structure your raw and processed data into SQL-like tables saves in long-term memory, making it easier to query and interact with datasets through Spark SQL. To learn more about how Hive Metastore is integrate and used visit Table Explorer page.
Data visualization and sampling
Ilum supplies you with multiple tools to visualize your data.
To track how your Ilum Jobs interact with datasets, you can use Lineage for a visual representation of data flows and transformations. Data lineage provides a clear, graphical view of the entire data lifecycle, from ingestion to processing and storage, allowing you to trace how data moves and is transformed throughout your system. This helps with debugging, optimizing workflows, and ensuring data integrity. To learn more about data lineage and how to leverage it in Ilum, visit the Lineage page.
For data organized with Hive Metastore, you can run SQL queries directly within the Ilum UI using Ilum SQL to easily retrieve small portions of that data. This provides a streamlined way to interact with your datasets without the need for complex setup or coding. To learn more about Ilum SQL read this page
Additionally, using Table Explorer, you can explore all the tables in your environment. Table Explorer provides an advanced data exploration tool, allowing you to visualize your data, build charts, and apply mathematical functions for in-depth analysis. This intuitive interface helps you quickly gain insights from your data, making it easier to perform tasks like aggregations, filtering, and transformations directly within the UI. To learn more about Table Explorer read this documentation page
Data monitoring
As your architecture grows, it becomes increasingly difficult to monitor each individual Spark application separately. This is where the need for a centralized monitoring tool becomes crucial.
Ilum addresses this by launching the Ilum History Server, which provides comprehensive monitoring of all Spark job details, including CPU and memory usage, stage progression, job and task schemas, timing, and more. This allows you to track the performance of your Spark applications at a granular level.
All this information is displayed centrally in the Ilum UI, making it easy to monitor and analyze your applications from a single dashboard. This centralized monitoring approach is also fully supported in a multi-cluster architecture, enabling you to track and manage Spark jobs across multiple clusters seamlessly.
If you require more advanced monitoring, Ilum provides a preconfigured Kube Prometheus stack, which includes Prometheus and Grafana. This stack is designed to monitor all Ilum Jobs metrics, giving you deep insights into your Spark jobs and their performance. Additionally, Graphite, a similar tool to Prometheus but based on a push-based architecture, is more suited for multi-cluster environments, offering another layer of flexibility in large-scale deployments.
For log aggregation, Ilum also integrates Loki with Promtail, allowing you to gather and query logs from your Ilum Jobs in a way that suits your needs. This enables efficient log management and troubleshooting across your entire infrastructure.
To learn more about advanced monitoring in Ilum, visit the documentation page.
Use Cases
To see how you can utilize Ilum to solve real world problems visit Use Cases section of the documentation by this link
Deployment and Secutiry
Thanks to our Helm charts, each feature can be enabled with a simple Helm command. To learn how to enable and disable features, visit the Production page.
For managing Ilum security, refer to the Security documentation page.
If you are upgrading Ilum, make sure to check the Upgrade Notes and Migration pages.