Configure Cloud Object Storage (GCS, S3, Azure) for Data Lake
Ilum allows you to link GCS, S3, WASBS, and HDFS storages to your clusters. Linking storage allows Ilum to automatically configure all your jobs to use your cloud data lakes seamlessly, eliminating the need for manual Spark parameter configuration.
Supported Storage Providers
| Provider | Type | Description |
|---|---|---|
| Google Cloud Storage | GCS | Native integration for GCP projects. |
| Amazon S3 | S3 | Standard S3 and S3-compatible storage support. |
| Azure Blob Storage | WASBS/ABFS | Integration for Azure data lakes. |
| HDFS | HDFS | Connect to existing Hadoop Distributed File Systems. |
- Google Cloud Storage (GCS)
- Amazon S3
- Azure Blob Storage
Google Cloud Storage (GCS)
Step 1: Create a GCS Bucket
Demo:
-
Create a Google Cloud Project
- Open Google Cloud Console and go to Project Selector / Manage Resources.
- Click New Project / Create Project.
- Enter a Project name, choose Organization and Location.
-
Create a GCS Bucket
- In the Console, navigate to Cloud Storage → Buckets.
- Click Create.
- Enter a globally unique Bucket name (e.g.,
my-ilum-bucket) and select your Region.
noteRemember the bucket name you created - you will need it when adding this storage to Ilum.
-
Create a Service Account and JSON Key
- Go to IAM & Admin → Service Accounts.
- Click Create Service Account, fill in details, and grant Storage Admin roles.
- Click the created email, go to the Keys tab, and Create new key (JSON).
- Save the downloaded JSON file securely.
importantOrganization Policy Update: In new organizations, creating service account keys might be disabled by default. Contact your administrator if you cannot create keys.
Step 2: Add GCS to Ilum Cluster
Demo:
-
Navigate to Workloads → Clusters → Edit → Storage → Add Storage.
-
Configure General Settings:
| Parameter | Value Example | Description |
|---|---|---|
| Name | my-gcs-storage | Unique name for this storage config. |
| Type | GCS | Select GCS provider. |
| Spark Bucket | my-ilum-bucket | Bucket for Spark logs/events. |
| Data Bucket | my-ilum-bucket | Bucket for your data. |
- Configure GCS Authorization: Open your JSON key file and copy the values:
| Parameter | Source Key | Description |
|---|---|---|
| Client Email | client_email | Service account email address. |
| Private Key | private_key | Full key including -----BEGIN.... |
| Private Key ID | private_key_id | Key ID string. |
- Click Submit to save.
Amazon S3
The process for adding S3 storage is nearly identical to GCS. You will need to provide your AWS credentials (Access Key and Secret Key) instead of a JSON key file.
- Navigate to Workloads → Clusters → Edit → Storage → Add Storage.
- Select S3 as the Type.
- Fill in the required fields:
| Parameter | Description |
|---|---|
| Name | Unique name for this storage config. |
| Access Key | Your AWS Access Key ID. |
| Secret Key | Your AWS Secret Access Key. |
| Region | AWS Region of your bucket (e.g., us-east-1). |
| Endpoint | (Optional) Custom endpoint for S3-compatible storage (e.g., MinIO). |
Azure Blob Storage
The process for adding Azure storage is nearly identical to GCS and S3. You will need your Azure Storage Account Name and Access Key.
- Navigate to Workloads → Clusters → Edit → Storage → Add Storage.
- Select Azure (or WASBS) as the Type.
- Fill in the required fields:
| Parameter | Description |
|---|---|
| Name | Unique name for this storage config. |
| Account Name | Your Azure Storage Account name. |
| Account Key | Your Azure Storage Account Access Key. |
| Container | Name of the container to use. |
Step 3: Verify Connection
To ensure your storage is correctly configured, run a simple Spark job.
-
Create a Code Service:
- Go to Workloads → Services → New Service +.
- Select Type:
Code, Language:Scala, and your Cluster.
-
Execute Test Code: Paste and run the following Scala code:
Test Storage Connection// Write test data
val data = Seq(("Alice", 34), ("Bob", 45))
val df = spark.createDataFrame(data).toDF("name", "age")
// Replace with your bucket path (e.g., gs://..., s3a://..., wasbs://...)
val path = "gs://my-ilum-bucket/output/"
df.write.mode("overwrite").format("csv").save(path)
// Read back data
spark.read.format("csv").load(path).show() -
Check Results: If the job completes and displays the data table, your storage connection is active.
Common Issues & FAQ
Why do I get a "Permission Denied" error?
Cause: The Service Account or User doesn't have permissions to access the bucket. Solution:
- Go to your cloud provider's console (e.g., Google Cloud Console).
- Navigate to the bucket's Permissions tab.
- Grant your service account the Storage Admin or Storage Object Admin role.
Why does it say "Bucket does not exist"?
Cause: The bucket name in your code doesn't match the actual bucket name, or the region is incorrect. Solution:
- Verify the bucket exists in your cloud console.
- Check that the bucket name in your code matches exactly (names are often case-sensitive).
Why do I get "Invalid credentials"?
Cause: The keys (JSON or Access Keys) were not copied correctly. Solution:
- Re-open your key file.
- Carefully copy the values again. For GCS, ensure you include the
-----BEGIN PRIVATE KEY-----and-----END PRIVATE KEY-----lines. - Re-save the storage configuration in Ilum.