Production

For production environments, it's recommended to deploy all dependencies in separate namespaces.

Kubernetes Prerequisites

Ilum has been extensively tested across all leading Kubernetes environments, ensuring compatibility with a variety of deployment scenarios. This includes lightweight Kubernetes distributions such as k3s and Rancher, as well as bare-metal Kubernetes clusters. Additionally, Ilum is fully compatible with major managed Kubernetes services in the cloud, including Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS).

Air-gapped (offline) environments

Guide is here

Minikube for Testing

Throughout our documentation, we use Minikube for demonstration and testing purposes. Minikube provides an easy-to-set-up environment that allows users to quickly try out Ilum's features on a local machine. However, it is important to note that Minikube is not suitable for production deployments due to its limitations in scalability, resource management, and high availability.

For production use, we strongly recommend deploying Ilum on a robust Kubernetes setup that aligns with your infrastructure needs, ensuring optimal performance and reliability.

Prerequisites

The table below provides necessary prerequisites and related instructions.

Prerequisite	Instruction
MongoDB	Refer to `https://bitnami.com/stack/mongodb/helm`
Kafka	Refer to `https://bitnami.com/stack/kafka/helm`
ObjectStorage	Refer to `https://min.io/docs/minio/kubernetes/upstream/operations/installation.html`

ilum-core

helm install ilum-core --create-namespace -n ilum --set mongo.instances=<mongo uri> --set kafka.address=<kafka broker address> --set s3a.host=<s3 host> --set s3a.port=<s3 port> ilum/ilum-core

ilum-ui

helm install ilum-ui --create-namespace -n ilum ilum/ilum-ui

MongoDB

Ilum employs MongoDB as its storage layer, preserving all data required between restarts within the MongoDB database. Ilum automatically creates all necessary databases and collections during the startup process.

Apache Kafka

Apache Kafka serves as Ilum's communication layer, facilitating interaction between Ilum-Core and Spark jobs, as well as between different Ilum-Core instances when scaled. It is critical to ensure Apache Kafka brokers are accessible by both Ilum-Core and Spark jobs, especially when Spark jobs are launched on a different Kubernetes cluster.

Ilum utilizes Kafka to carry out communication using several topics, all created during Ilum's startup. Therefore, users don't need to manage these topics manually.

MinIO

Ilum uses MinIO as the storage layer for Spark application components. All files (including jars, configurations, data files) needed for the operation of Spark components (driver, executors) are stored and made available for download via MinIO.

MinIO implements the S3 interface, which also enables it to store input/output data.

Security keys

This application uses JSON Web Tokens (JWT) for authentication purposes. By default, the application employs an RSA key pair, which is randomly generated at runtime, to sign these tokens.

In its standard configuration, the application creates a fresh RSA key pair each time it starts. This approach simplifies local development and testing by automatically handling the key generation process. However, it must be emphasized that this approach is not suitable for a production environment.

The primary issue with using randomly generated keys in a production environment is the lack of persistence. Each time the application restarts, it generates a new RSA key pair, invalidating all previously issued tokens. This could lead to an abrupt and unanticipated logout for all users, disrupting user experience and potentially leading to data loss.

Generate private key

For a production environment, a stable and secure key pair should be manually generated and used consistently. This ensures that tokens remain valid across multiple application restarts, thus providing a consistent user experience.

You can generate an RSA key pair manually using tools like OpenSSL. A common command to generate a 2048-bit RSA private key is as follows:

openssl genpkey -algorithm RSA \
    -pkeyopt rsa_keygen_bits:2048 \
    -pkeyopt rsa_keygen_pubexp:65537 | \
  openssl pkcs8 -topk8 -nocrypt -outform pem > private-key.p8

The contents of the private key should look like the following:

In order to use private key as the setting security.jwt.privateKey, remove header and footer from the key.

Generate public key

To generate the corresponding public key, use:

openssl pkey -pubout -inform pem -outform pem -in private-key.p8 -out public-key.spki

The contents of the public key should look like the following:

In order to use public key as the setting security.jwt.publicKey, remove header and footer from the key.

Modules

Ilum provides several modules that are integrated and preconfigured and will be useful in your data infrastructure.

Ilum-Livy-proxy

Ilum-Livy proxy is our implementation of Livy Api, that integrates spark code with Ilum Groups in services such as Jupyter, Zeppelin, Airflow

Ilum Livy-proxy is enabled in Ilum by default.

In case you want to add or remove Ilum-Livy-proxy, you can use ilum-livy-proxy.enabled helm value to manage it. For example: --set ilum-livy-proxy.enabled=false to disable it.

Jupyter

Jupyter is a Notebook - sophisticated development environment which allows you to have code, charts, explanations and more in one executable document.

Jupyter is enabled in Ilum by default.

However, in case you want to control whether it is enabled or not, you can use helm value ilum-jupyter.enabled. For example, you can add --set ilum-jupyter.enabled=false to your installation command to disable it.

Be aware, that Jupyter makes use of Ilum-Livy-proxy to integrate with Ilum Groups. Therefore, you should enable it as well: --set ilum-livy-proxy.enabled=true

If you want to access the Jupyter UI, you can do it by:

using Ilum UI: go to Modules > Jupyter
configuring an ingress
using the port-forward command kubectl port-forward svc/ilum-jupyter 8888:8888

Apache Zeppelin

Zeppelin is a Notebook - sophisticated development environment which allows you to have code, charts, explanations and more in one executable document.

Please be aware, that Zeppelin notebook is not bundled in ilum package by default. If you want to run this service, add --set ilum-zeppelin.enabled=true to your installation command.

Be aware, that Zeppelin makes use of Ilum-Livy-proxy to integrate with Ilum Groups. Therefore, you should enable it as well: --set ilum-livy-proxy.enabled=true

If you want to access the Zeppelin UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-zeppelin 8080:8080

Hive Metastore

Note: Hive Metastore is not enabled in Ilum by default.

Hive Metastore is a metadata storage used to store your Spark catalogs (Spark tables, databases, views, and more) in a database instead of runtime memory. You can view these schemas later on the Table Explorer page.

To enable the Hive Metastore bundled instance, add --set ilum-hive-metastore.enabled=true to your installation command. You'll also need to include --set ilum-core.hiveMetastore.enabled=true to link it with the Table Explorer.

Take into account, that Hive Metastore uses PostgreSQL database to store metadata. You can read about it below.

Ilum SQL

Note: Ilum SQL is not enabled in Ilum by default.

To enable it, add --set ilum-sql.enabled=true to enable the SQL execution host and --set ilum-core.sql.enabled=true to enable the SQL viewer inside Ilum itself.

Ilum SQL can execute SQL queries on your data in the UI. More about it read on the SQL Viewer page.

Trino

Note: Ilum SQL is not enabled in Ilum by default. To enable it, add --set trino.enabled=true to enable a built-in Trino distribution.

Trino is a distributed SQL query engine that allows you to run queries quickly. It is substantially more useful than Spark when it comes to interactive queries.

Ilum uses Trino to run SQL queries on your data in the UI. More about it read on the SQL Viewer page.

n8n

Note: n8n is not enabled in Ilum by default. To enable it, add --set ilum-n8n.enabled=true to enable a built-in n8n distribution.

n8n is a fair-code workflow automation platform with native AI capabilities.

Apache Airflow

Airflow is a tool for management of data pipelines.

Please be aware, that Airflow is not bundled in ilum package by default. If you want to run this service, add --set airflow.enabled=true to your installation command.

Take into account that Airflow can use Ilum-Livy-proxy to create jobs integrated with Ilum Groups. In case you want to use Ilum-Livy-proxy and it is disabled, you can enable it with helm value ilum-livy-proxy.enabled

If you want to access the Airflow UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-webserver 8080:8080

Marquez

Marquez is an open-source metadata management tool that focuses on capturing, aggregating, and visualizing the lineage of data assets within an organization’s data ecosystem. It tracks how datasets are produced and consumed by different jobs and provides a central view of these dependencies.

Please be aware that Marquez is not bundled in Ilum package by default. If you want to run this service, add --set global.lineage.enabled=true to your installation command.

Take into account that Marquez makes use of PostgreSQL database to store the metadata. You can read about it below.

Additionally, if you wish to use Marquez’s web client instead of Ilum’s UI, enable the default web client with --set ilum-marquez.web.enabled=true and set up one of the access methods:

use the port-forward command kubectl port-forward svc/ilum-marquez-web 9444:9444
configure an ingress

Read more about Marquez and Ilum Lineage here

Kestra

Kestra is an open-source data orchestration platform designed for orchestrating and automating data pipelines and business workflows. You can read about it here.

Kestra is not enabled in Ilum by default. To enable it, add --set kestra.enabled=true to your installation command.

Kestra uses PostgreSQL database to store the data about the jobs and tasks and Minio for general file storage

PostgreSQL

PostgreSQL database is used by services such as Marquez, Hive Metastore, Airflow in order to store metadata.

PostgreSQL databases are enabled in Ilum by default.

If you want to control whether PostgreSQL is enabled or not, you can use helm value postgresql.enabled. For example, to disable it, you can add --set postgresql.enabled=false to your installation command.

Kube Prometheus Stack

Kube Prometheus Stack includes Prometheus, Grafana and other tools for monitoring your data infrastructure

Please be aware, that Kube Prometheus Stack is not bundled in ilum package by default. If you want to run this service, add --set kube-prometheus-stack.enabled=true to your installation command.

If you are upgrading an existing Ilum Helm chart that previously did not have the Kube Prometheus Stack enabled, you must first install the required Prometheus Custom Resource Definitions (CRDs) before proceeding with the upgrade. To do this, run the following commands:

kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheusagents.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_scrapeconfigs.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.80.0/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml

If you want to access the Prometheus UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/prometheus-operated 9090:9090

If you want to access the Grafana UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-grafana 8080:80

Loki and Promtail

Loki is used to gather and manage logs of your data infrastructure. Promtail is used as an agent that pushes logs into Loki

Please be aware, that Loki is not enabled in Ilum by default. If you want to run this service, add --set global.logAggregation.loki.enabled=true to your installation command.

Promtail also is not enabled in Ilum by default. To enable it add --set global.logAggreagtion.promtail.enabled=true to your installation command

If you want to access Loki and run Loki Queries, you can configure an ingress or use the port-forward command kubectl port-forward svc/ilum-loki-read 3100:3100 for read queries and kubectl port-forward svc/ilum-loki-write 3100:3100 for write queries. You can also use service ilum-loki-gateway to link grafana to loki

Graphite

Please be aware, that Graphite is not bundled in ilum package by default. If you want to run this service, add --set graphite-exporter.graphite.enabled=true to your installation command.

Ilum as Identity Provider

Ilum can be deployed as an Identity Provider. With this feature, you can manage users exclusively within Ilum and authenticate them across other microservices such as Airflow, Superset, Grafana, Gitea, and Minio.

To enable Ilum's Identity Provider, add the following flags: --set global.security.hydra.enabled=true and --set global.security.hydra.uiUrl=<your-ilum-ui-domain>.

To learn more about Identity Provider configuration, visit this page

Trouble Shooting

Image Pulling Errors

During the installation of Ilum on your cluster, Helm will pull Docker images, which may be as large as 10 GB, depending on the additional modules you enable. Consequently, with a slow internet connection, you might encounter Image Pull Timeout errors if the image download time exceeds the configured timeout. To resolve this issue, you can:

Pull Docker image manually by running:

minikube ssh docker pull image
# for example
minikube ssh docker pull ilum/core-6.1.3

Change the image pull timeout in your kubernetes configurations like this:

minikube start --extra-config=kubelet.runtime-request-timeout=5m

or like this:

minikube start --extra-config=kubelet.image-pull-progress-deadline=5m

Default Passwords / Credentials

Ilum comes with predefined credentials for various modules to simplify initial setup and testing. However, for production deployments, it is critical to change these default credentials to ensure security and prevent unauthorized access.

Default Credentials

Application	Default Username	Default Password
Ilum UI	admin	admin
MinIO Console	minioadmin	minioadmin
Airflow Web UI	admin	admin
Superset UI	admin	admin
Gitea UI	ilum	ilum
Grafana	admin	admin

Database Credentials (For Internal Use)

Database	Default Username	Default Password
PostgreSQL	postgres	CHANGEMEPLEASE
Marquez	postgres	CHANGEMEPLEASE
Hive Metastore	postgres	CHANGEMEPLEASE

Kubernetes Prerequisites​

Air-gapped (offline) environments​

Minikube for Testing​

Prerequisites​

MongoDB​

Apache Kafka​

MinIO​

Security keys​

Generate private key​

Generate public key​

Modules

Ilum-Livy-proxy​

Jupyter​

Apache Zeppelin​

Hive Metastore​

Ilum SQL​

Trino​

n8n​

Apache Airflow​

Marquez​

Kestra​

PostgreSQL​

Kube Prometheus Stack​

Loki and Promtail​

Graphite​

Ilum as Identity Provider​

Trouble Shooting​

Image Pulling Errors​

Default Passwords / Credentials​

Default Credentials​

Database Credentials (For Internal Use)​