Skip to main content

Ilum Storage

To provide greater flexibility and enrich the range of possibilities available to Ilum users, we have expanded the possibility of configuring the storage on which Ilum runs. Until now, Ilum was strongly tied to S3, but as of version 6.1.0 we have changed this. Currently, it is possible to use one of 4 types of storages:

  1. S3 - Amazon Simple Storage Service or any S3 compatible interface like SeaweedFS or MinIO.
  2. GCS - Google Cloud Storage.
  3. WASBS - Azure Blob Storage.
  4. HDFS - Hadoop Distributed File System.

The Ilum default cluster storage can be configured with helm values. Here's how you can configure each of them using the helm upgrade command:

S3

helm upgrade \
--set ilum-core.kubernetes.upgradeClusterOnStartup=true \
--set ilum-core.kubernetes.storage.type=s3 \
--set ilum-core.kubernetes.s3.host=ilum-minio \
--set ilum-core.kubernetes.s3.port=9000 \
--set ilum-core.kubernetes.s3.sparkBucket=ilum-spark \
--set ilum-core.kubernetes.s3.dataBucket=ilum-data \
--set ilum-core.kubernetes.s3.accessKey=minioadmin \
--set ilum-core.kubernetes.s3.secretKey=minioadmin \
--reuse-values ilum ilum/ilum

GCS

helm upgrade \
--set ilum-core.kubernetes.upgradeClusterOnStartup=true \
--set ilum-core.kubernetes.storage.type=gcs \
--set ilum-core.kubernetes.gcs.clientEmail=gcsEmail \
--set ilum-core.kubernetes.gcs.privateKey=gcsPrivateKey \
--set ilum-core.kubernetes.gcs.privateKeyId=gcsPrivateKeyId \
--set ilum-core.kubernetes.gcs.sparkBucket=ilum-spark \
--set ilum-core.kubernetes.gcs.dataBucket=ilum-data \
--reuse-values ilum ilum/ilum

WASBS - Azure Blob Storage

helm upgrade \
--set ilum-core.kubernetes.upgradeClusterOnStartup=true \
--set ilum-core.kubernetes.storage.type=wasbs \
--set ilum-core.kubernetes.wasbs.accessKey=wasbsAccessKey \
--set ilum-core.kubernetes.wasbs.accountName=wasbsAccountName \
--set ilum-core.kubernetes.wasbs.sparkContainer=ilum-spark \
--set ilum-core.kubernetes.wasbs.dataContainer=ilum-data \
--reuse-values ilum ilum/ilum

HDFS

helm upgrade \
--set ilum-core.kubernetes.upgradeClusterOnStartup=true \
--set ilum-core.kubernetes.storage.type=hdfs \
--set ilum-core.kubernetes.hdfs.hadoopUsername=hdfs \
--set ilum-core.kubernetes.hdfs.config.'core-site\.xml'=base64EncodedCore-SiteFileContent \
--set ilum-core.kubernetes.hdfs.config.'yarn-site\.xml'=base64EncodedYarn-SiteFileContent \
--set ilum-core.kubernetes.hdfs.config.'hdfs-site\.xml'=base64EncodedHdfs-SiteFileContent \
--set ilum-core.kubernetes.hdfs.config.'mapred-site\.xml'=base64EncodedMapred-SiteFileContent \
--set ilum-core.kubernetes.hdfs.sparkCatalog=ilum-spark \
--set ilum-core.kubernetes.hdfs.dataCatalog=ilum-data \
--set ilum-core.kubernetes.hdfs.logDirectory=hdfs://10.1.2.3/user/hdfs/illum-spark/ilum/logs \
--reuse-values ilum ilum/ilum

Note: Hdfs configuration files must be encoded in base64 format

Warning: Be careful, if you want to change the spark storage of the cluster, make sure that it does not have groups or jobs assigned to it, otherwise it may lead to problems when with removing them after the update

Depending on which of the above options you choose, a default cluster and spark history server, if enabled, will be created configured with this specific storage.

Multiple storages can be assigned to each Ilum cluster on cluster creation level, the necessary configuration of each of them is transferred to the spark jobs of this cluster, so we can easily access the ilum tables located in those storages.

Example

You can find examples on how to take advantage of multi storage Ilum cluster in your spark jobs in this repository