Skip to main content

Configure Cloud Object Storage (GCS, S3, Azure) for Data Lake

Ilum allows you to link GCS, S3, WASBS, and HDFS storages to your clusters. Linking storage allows Ilum to automatically configure all your jobs to use your cloud data lakes seamlessly, eliminating the need for manual Spark parameter configuration.

Supported Storage Providers

ProviderTypeDescription
Google Cloud StorageGCSNative integration for GCP projects.
Amazon S3S3Standard S3 and S3-compatible storage support.
Azure Blob StorageWASBS/ABFSIntegration for Azure data lakes.
HDFSHDFSConnect to existing Hadoop Distributed File Systems.

Google Cloud Storage (GCS)

Step 1: Create a GCS Bucket

Demo:

Guide In Full Screen

  1. Create a Google Cloud Project

    • Open Google Cloud Console and go to Project Selector / Manage Resources.
    • Click New Project / Create Project.
    • Enter a Project name, choose Organization and Location.
  2. Create a GCS Bucket

    • In the Console, navigate to Cloud StorageBuckets.
    • Click Create.
    • Enter a globally unique Bucket name (e.g., my-ilum-bucket) and select your Region.
    note

    Remember the bucket name you created - you will need it when adding this storage to Ilum.

  3. Create a Service Account and JSON Key

    • Go to IAM & AdminService Accounts.
    • Click Create Service Account, fill in details, and grant Storage Admin roles.
    • Click the created email, go to the Keys tab, and Create new key (JSON).
    • Save the downloaded JSON file securely.
    important

    Organization Policy Update: In new organizations, creating service account keys might be disabled by default. Contact your administrator if you cannot create keys.

Step 2: Add GCS to Ilum Cluster

Demo:

Guide in Full Screen

  1. Navigate to WorkloadsClustersEditStorageAdd Storage.

  2. Configure General Settings:

ParameterValue ExampleDescription
Namemy-gcs-storageUnique name for this storage config.
TypeGCSSelect GCS provider.
Spark Bucketmy-ilum-bucketBucket for Spark logs/events.
Data Bucketmy-ilum-bucketBucket for your data.
  1. Configure GCS Authorization: Open your JSON key file and copy the values:
ParameterSource KeyDescription
Client Emailclient_emailService account email address.
Private Keyprivate_keyFull key including -----BEGIN....
Private Key IDprivate_key_idKey ID string.
  1. Click Submit to save.

Step 3: Verify Connection

To ensure your storage is correctly configured, run a simple Spark job.

  1. Create a Code Service:

    • Go to WorkloadsServicesNew Service +.
    • Select Type: Code, Language: Scala, and your Cluster.
  2. Execute Test Code: Paste and run the following Scala code:

    Test Storage Connection
    // Write test data
    val data = Seq(("Alice", 34), ("Bob", 45))
    val df = spark.createDataFrame(data).toDF("name", "age")

    // Replace with your bucket path (e.g., gs://..., s3a://..., wasbs://...)
    val path = "gs://my-ilum-bucket/output/"

    df.write.mode("overwrite").format("csv").save(path)

    // Read back data
    spark.read.format("csv").load(path).show()
  3. Check Results: If the job completes and displays the data table, your storage connection is active.


Common Issues & FAQ

Why do I get a "Permission Denied" error?

Cause: The Service Account or User doesn't have permissions to access the bucket. Solution:

  1. Go to your cloud provider's console (e.g., Google Cloud Console).
  2. Navigate to the bucket's Permissions tab.
  3. Grant your service account the Storage Admin or Storage Object Admin role.

Why does it say "Bucket does not exist"?

Cause: The bucket name in your code doesn't match the actual bucket name, or the region is incorrect. Solution:

  1. Verify the bucket exists in your cloud console.
  2. Check that the bucket name in your code matches exactly (names are often case-sensitive).

Why do I get "Invalid credentials"?

Cause: The keys (JSON or Access Keys) were not copied correctly. Solution:

  1. Re-open your key file.
  2. Carefully copy the values again. For GCS, ensure you include the -----BEGIN PRIVATE KEY----- and -----END PRIVATE KEY----- lines.
  3. Re-save the storage configuration in Ilum.