Production
For production environments, it's recommended to deploy all dependencies in separate namespaces.
Kubernetes Prerequisites
Ilum has been extensively tested across all leading Kubernetes environments, ensuring compatibility with a variety of deployment scenarios. This includes lightweight Kubernetes distributions such as k3s and Rancher, as well as bare-metal Kubernetes clusters. Additionally, Ilum is fully compatible with major managed Kubernetes services in the cloud, including Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS).
Minikube for Testing
Throughout our documentation, we use Minikube for demonstration and testing purposes. Minikube provides an easy-to-set-up environment that allows users to quickly try out Ilum's features on a local machine. However, it is important to note that Minikube is not suitable for production deployments due to its limitations in scalability, resource management, and high availability.
For production use, we strongly recommend deploying Ilum on a robust Kubernetes setup that aligns with your infrastructure needs, ensuring optimal performance and reliability.
Prerequisites
The table below provides necessary prerequisites and related instructions.
Prerequisite | Instruction |
---|---|
MongoDB | Refer to https://bitnami.com/stack/mongodb/helm |
Kafka | Refer to https://bitnami.com/stack/kafka/helm |
ObjectStorage | Refer to https://min.io/docs/minio/kubernetes/upstream/operations/installation.html |
ilum-core
helm install ilum-core --create-namespace -n ilum --set mongo.instances=<mongo uri> --set kafka.address=<kafka broker address> --set s3a.host=<s3 host> --set s3a.port=<s3 port> ilum/ilum-core
ilum-ui
helm install ilum-ui --create-namespace -n ilum ilum/ilum-ui
MongoDB
Ilum employs MongoDB as its storage layer, preserving all data required between restarts within the MongoDB database. Ilum automatically creates all necessary databases and collections during the startup process.
Apache Kafka
Apache Kafka serves as Ilum's communication layer, facilitating interaction between Ilum-Core and Spark jobs, as well as between different Ilum-Core instances when scaled. It is critical to ensure Apache Kafka brokers are accessible by both Ilum-Core and Spark jobs, especially when Spark jobs are launched on a different Kubernetes cluster.
Ilum utilizes Kafka to carry out communication using several topics, all created during Ilum's startup. Therefore, users don't need to manage these topics manually.
MinIO
Ilum uses MinIO as the storage layer for Spark application components. All files (including jars, configurations, data files) needed for the operation of Spark components (driver, executors) are stored and made available for download via MinIO.
MinIO implements the S3 interface, which also enables it to store input/output data.
Security keys
This application uses JSON Web Tokens (JWT) for authentication purposes. By default, the application employs an RSA key pair, which is randomly generated at runtime, to sign these tokens.
In its standard configuration, the application creates a fresh RSA key pair each time it starts. This approach simplifies local development and testing by automatically handling the key generation process. However, it must be emphasized that this approach is not suitable for a production environment.
The primary issue with using randomly generated keys in a production environment is the lack of persistence. Each time the application restarts, it generates a new RSA key pair, invalidating all previously issued tokens. This could lead to an abrupt and unanticipated logout for all users, disrupting user experience and potentially leading to data loss.
Generate private key
For a production environment, a stable and secure key pair should be manually generated and used consistently. This ensures that tokens remain valid across multiple application restarts, thus providing a consistent user experience.
You can generate an RSA key pair manually using tools like OpenSSL. A common command to generate a 2048-bit RSA private key is as follows:
openssl genpkey -algorithm RSA \
-pkeyopt rsa_keygen_bits:2048 \
-pkeyopt rsa_keygen_pubexp:65537 | \
openssl pkcs8 -topk8 -nocrypt -outform pem > private-key.p8
The contents of the private key should look like the following:
-----BEGIN PRIVATE KEY-----
MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQCsRnE83rm6BJya
nTyzVqX0SG+D4zBjkyWsOmGG+CoDdgQ6Z8AaocmnjP1SbRykQsQSMf6SeW+fdpH+
ccmzuHe7pZIa2o2Mg8xbk/UszJDaPztwoQbUt/2gHi/rZP8cIVkquzhnN/yxrMls
...
-----END PRIVATE KEY-----
In order to use private key as the setting security.jwt.privateKey
, remove header and footer from the key.
Generate public key
To generate the corresponding public key, use:
openssl pkey -pubout -inform pem -outform pem -in private-key.p8 -out public-key.spki
The contents of the public key should look like the following:
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEArEZxPN65ugScmp08s1al
9Ehvg+MwY5MlrDphhvgqA3YEOmfAGqHJp4z9Um0cpELEEjH+knlvn3aR/nHJs7h3
u6WSGtqNjIPMW5P1LMyQ2j87cKEG1Lf9oB4v62T/HCFZKrs4Zzf8sazJbMN3E/mJ
...
-----END PUBLIC KEY-----
In order to use public key as the setting security.jwt.publicKey
, remove header and footer from the key.
Modules
Ilum provides several modules that are integrated and preconfigured and will be useful in your data infrastructure.
Ilum-Livy-proxy
Ilum-Livy proxy is our implementation of Livy Api, that integrates spark code with Ilum Groups in services such as Jupyter, Zeppelin, Airflow
Ilum Livy-proxy is enabled in Ilum by default.
In case you want to add or remove Ilum-Livy-proxy, you can use ilum-livy-proxy.enabled
helm value to manage it.
For example: --set ilum-livy-proxy.enabled=false
to disable it.
Read more about Ilum-Livy-proxy here
Jupyter
Jupyter is a Notebook - sophisticated development environment which allows you to have code, charts, explanations and more in one executable document.
Jupyter is enabled in Ilum by default.
However, in case you want to control whether it is enabled or not, you can use helm value ilum-jupyter.enabled
. For example, you can add
--set ilum-jupyter.enabled=false
to your installation command to disable it.
Be aware, that Jupyter makes use of Ilum-Livy-proxy to integrate with Ilum Groups. Therefore, you should enable it as well:
--set ilum-livy-proxy.enabled=true
If you want to access the Jupyter UI, you can do it by:
- using Ilum UI: go to Modules > Jupyter
- configuring an ingress
- using the port-forward command
kubectl port-forward svc/ilum-jupyter 8888:8888
Read more about Jupyter here
Apache Zeppelin
Zeppelin is a Notebook - sophisticated development environment which allows you to have code, charts, explanations and more in one executable document.
Please be aware, that Zeppelin notebook is not bundled in ilum package by default. If you want to run this service, add --set ilum-zeppelin.enabled=true
to your installation command.
Be aware, that Zeppelin makes use of Ilum-Livy-proxy to integrate with Ilum Groups. Therefore, you should enable it as well:
--set ilum-livy-proxy.enabled=true
If you want to access the Zeppelin UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-zeppelin 8080:8080
Read more about Zeppelin here
Hive Metastore
Note: Hive Metastore is not enabled in Ilum by default.
Hive Metastore is a metadata storage used to store your Spark catalogs (Spark tables, databases, views, and more) in a database instead of runtime memory. You can view these schemas later on the Table Explorer page.
To enable the Hive Metastore bundled instance,
add --set ilum-hive-metastore.enabled=true
to your installation command.
You'll also need to include --set ilum-core.hiveMetastore.enabled=true
to link it with the Table Explorer.
Take into account, that Hive Metastore uses PostgreSQL database to store metadata. You can read about it below.
Ilum SQL
Note: Ilum SQL is not enabled in Ilum by default.
To enable it, add --set ilum-sql.enabled=true
to enable the SQL execution host and
--set ilum-core.sql.enabled=true
to enable the SQL viewer inside Ilum itself.
Ilum SQL can execute SQL queries on your data in the UI. More about it read on the SQL Viewer page.
Apache Airflow
Airflow is a tool for management of data pipelines.
Please be aware, that Airflow is not bundled in ilum package by default. If you want to run this service, add --set airflow.enabled=true
to your installation command.
Take into account that Airflow can use Ilum-Livy-proxy to create jobs integrated with Ilum Groups. In case you want to use Ilum-Livy-proxy and it is disabled, you can enable it with helm value ilum-livy-proxy.enabled
If you want to access the Airflow UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-webserver 8080:8080
Marquez
Marquez is a tool that is used to store metadata about data flow in you data infrastructure. This metadata is used by Ilum Lineage in order to present relationships between Jobs and Dataset as a graph.
Please be aware, that Marquez is not bundled in ilum package by default. If you want to run this service, add --set global.lineage.enabled=true
to your installation command and --set ilum-marquez.web.enabled=true
for web client.
Take into account, that Marquez makes use of PostgreSQL database to store the metadata. You can read about it below.
If you want to access the Marquez UI, you can do it by:
- configuring an ingress
- using the port-forward command
kubectl port-forward svc/ilum-marquez-web 9444:9444
Read more about Marquez and Ilum Lineage here
PostgreSQL
Postgre SQL database is used by services such as Marquez, Hive Metastore, Airflow in order to store metadata.
PostheSQL databases are enabled in Ilum by default.
If you want to control whether PostgreSQL is enabled or not, you can use helm value postgresql.enabled
. For example, to disable them, you can
add --set postgresql.enabled=true
to your installation command.
Kube Prometheus Stack
Kube Prometheus Stack includes Prometheus, Grafana and other tools for monitoring your data infrastructure
Please be aware, that Kube Prometheus Stack is not bundled in ilum package by default. If you want to run this service, add --set kube-prometheus-stack.enabled=true
to your installation command.
If you want to access the Prometheus UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/prometheus-operated 9090:9090
If you want to access the Grafana UI, the best way to do it is by configuring an ingress or using the port-forward command kubectl port-forward svc/ilum-grafana 8080:80
Loki and Promtail
Loki is used to gather and manage logs of your data infrastructure. Promtail is used as an agent that pushes logs into Loki
Please be aware, that Loki is not enabled in Ilum by default. If you want to run this service, add --set global.logAggregation.loki.enabled=true
to your installation command.
Promtail also is not enabled in Ilum by default. To enable it add --set global.logAggreagtion.promtail.enabled=true
to your installation command
If you want to access Loki and run Loki Queries, you can configure an ingress or use the port-forward command kubectl port-forward svc/ilum-loki-read 3100:3100
for read queries and kubectl port-forward svc/ilum-loki-write 3100:3100
for write queries. You can also use
service ilum-loki-gateway
to link grafana to loki
Graphite
Please be aware, that Graphite is not bundled in ilum package by default. If you want to run this service, add --set graphite-exporter.graphite.enabled=true
to your installation command.
Trouble Shooting
Image Pulling Errors
During the installation of Ilum on your cluster, Helm will pull Docker images, which may be as large as 10 GB, depending on the additional modules you enable. Consequently, with a slow internet connection, you might encounter Image Pull Timeout errors if the image download time exceeds the configured timeout. To resolve this issue, you can:
- Pull Docker image manually by running:
minikube ssh docker pull image
# for example
minikube ssh docker pull ilum/core-6.1.3
- Change the image pull timeout in your kubernetes configurations like this:
minikube start --extra-config=kubelet.runtime-request-timeout=5m
or like this:
minikube start --extra-config=kubelet.image-pull-progress-deadline=5m
Default Passwords / Credentials
Ilum comes with predefined credentials for various modules to simplify initial setup and testing. However, for production deployments, it is critical to change these default credentials to ensure security and prevent unauthorized access.
Default Credentials
Application | Default Username | Default Password |
---|---|---|
Ilum UI | admin | admin |
MinIO Console | minioadmin | minioadmin |
Airflow Web UI | admin | admin |
Superset UI | admin | admin |
Gitea UI | ilum | ilum |
Grafana | admin | admin |
Database Credentials (For Internal Use)
Database | Default Username | Default Password |
---|---|---|
PostgreSQL | postgres | CHANGEMEPLEASE |
Marquez | postgres | CHANGEMEPLEASE |
Hive Metastore | postgres | CHANGEMEPLEASE |