Skip to main content

Troubleshoot Object Storage

Overview

This page catalogs the symptoms an operator most commonly encounters when something is off with the object-storage layer, the underlying cause, and the recovery procedure. Each recipe ends in one or two concrete kubectl or helm commands.

502 Bad Gateway from /external/object-storage/ or /external/minio/

Symptom

Loading http://<ingress>/external/object-storage/ or http://<ingress>/external/minio/ returns 502 Bad Gateway from nginx. The Object Storage view in the Ilum UI shows the gateway error inside the iframe.

Likely cause

The ilum-objectstorage Service alias has no endpoints. The selector points at a label that no pod carries.

Diagnosis

Inspect the alias annotation, selector, and endpoints:

kubectl -n ilum get svc ilum-objectstorage \
-o jsonpath='active-provider: {.metadata.annotations.ilum\.cloud/object-storage-active-provider}{"\n"}selector: {.spec.selector}{"\n"}'
kubectl -n ilum get endpoints ilum-objectstorage

If the endpoints column shows <none>, the selector does not match any pod. Common causes:

  • objectStorage.activeProvider was set to a name that does not match any running provider's app.kubernetes.io/name label.
  • The provider's chart was disabled (<provider>.enabled=false) without flipping activeProvider to a still-running provider.
  • A pre-upgrade override left the alias selector in an inconsistent state.

Recovery

Roll back to the last release revision whose values are known to be correct:

helm history ilum -n ilum
helm rollback ilum <revision> -n ilum
kubectl -n ilum rollout restart deploy/ilum-ui

Alternatively, override activeProvider to a still-running provider and re-upgrade:

helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.activeProvider=auto
kubectl -n ilum rollout restart deploy/ilum-ui

/external/object-storage/ redirects in a loop

Symptom

The browser keeps bouncing between /external/object-storage/ and the provider-specific console path; the page never renders.

Likely cause

The active provider's consoleMode is nginx-rewrite and its consolePath is /external/object-storage/ itself, so the redirect sends the browser back to where it came from.

Recovery

Set the provider's consolePath to a provider-specific path so the redirect target is distinct:

helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.providers.<provider>.consolePath=/external/<provider>/

Object Storage nav button does not load

Symptom

Clicking the Object Storage entry in the Ilum UI loads a blank iframe or shows a "file not found" message.

Likely cause

ILUM_OBJECT_STORAGE_PATH in the ilum-ui ConfigMap resolves to a path that the nginx proxy does not route, or no provider is active and the path falls back to the chart-wide default /external/object-storage/ which then 404s because no upstream is configured.

Diagnosis

Inspect the runtime path the UI uses:

kubectl -n ilum get configmap ilum-ui \
-o jsonpath='ILUM_OBJECT_STORAGE_PATH={.data.ILUM_OBJECT_STORAGE_PATH}{"\n"}'

Cross-check against the nginx configuration for the matching location block:

kubectl -n ilum exec deploy/ilum-ui -c ilum-ui -- \
grep -A5 'location /external/' /etc/nginx/conf.d/server.conf

Recovery

Ensure an in-cluster provider is enabled and either rely on the resolved default or override objectStorage.providers.<provider>.consolePath explicitly. Then restart the Ilum UI to pick up the new ConfigMap:

helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set <provider>.enabled=true
kubectl -n ilum rollout restart deploy/ilum-ui

helm template fails with "3 providers enabled"

Symptom

A helm install or helm upgrade fails at render time with a message similar to:

Error: ... objectStorage: 3 providers enabled ([minio rustfs seaweedfs]);
set objectStorage.activeProvider=<name> to pick which one user traffic
routes through

Likely cause

More than two providers are enabled simultaneously, and objectStorage.activeProvider is left at auto. The chart refuses to guess.

Recovery

Set the active provider explicitly:

helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.activeProvider=<provider>

Alternatively, disable the providers that are not relevant to user traffic by setting their enabled flags to false.

Alias has no endpoints despite a running provider

Symptom

A provider pod is running and ready, but kubectl get endpoints ilum-objectstorage shows <none>.

Likely cause

The pod's labels do not match the alias Service selector. The selector requires both app.kubernetes.io/name: <provider> and app.kubernetes.io/instance: <release>.

Diagnosis

kubectl -n ilum get pod -l app.kubernetes.io/name=<provider> \
-o jsonpath='{.items[*].metadata.labels}'
kubectl -n ilum get svc ilum-objectstorage -o jsonpath='{.spec.selector}'

Recovery

For pods deployed by a chart, ensure the chart sets both required labels. For hand-rolled Deployments (such as those created by the Add a New Provider procedure), patch the pod template to include the missing labels and re-roll the Deployment.

Stuck pending-upgrade after a failed helm upgrade --wait

Symptom

helm history ilum shows a revision in pending-upgrade state. Every subsequent helm upgrade fails immediately with a message similar to:

Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Likely cause

A previous helm upgrade --wait was interrupted (network drop, laptop crash, Ctrl-C). The release Secret recording the in-flight upgrade was never finalized.

Recovery

Delete the stuck release Secret and retry:

kubectl -n ilum get secret -l owner=helm,name=ilum
kubectl -n ilum delete secret sh.helm.release.v1.ilum.v<revision>
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values

The revision number is the highest one listed by helm history ilum that is in pending-upgrade state.

Cutover acknowledged but the alias still targets the old provider

Symptom

objectStorage.cutoverAcknowledged=true is set (or its legacy alias rustfs.migrationAcknowledged=true), but the alias annotation still shows the previous provider.

Likely cause

Either the Ilum UI's ConfigMap was not regenerated (the rollme: <random> annotation that forces a ilum-ui rollout did not change), or the operator did not run helm upgrade after flipping the flag.

Recovery

Re-run helm upgrade and force a UI rollout:

helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.cutoverAcknowledged=true
kubectl -n ilum rollout restart deploy/ilum-ui

Verify by inspecting the alias annotation:

kubectl -n ilum get svc ilum-objectstorage \
-o jsonpath='{.metadata.annotations.ilum\.cloud/object-storage-active-provider}{"\n"}'

Bucket-init Job stays Pending or fails

Symptom

After helm install or helm upgrade, the init-rustfs-buckets or init-minio-policies Job does not reach Complete. helm install --wait times out, or the bundled consumers report missing buckets at startup.

Likely cause

One of the following:

  • The ilum-objectstorage-credentials Secret is missing or has empty values for access-key / secret-key.
  • The provider's Service is reachable on cluster DNS but the provider pod is not yet Ready; the init Job's wait-for-<provider> init container is still looping.
  • The provider rejected the credentials (the bundled image baked in a different default than the live Secret).

Diagnosis

kubectl -n ilum logs job/init-rustfs-buckets -c wait-for-rustfs --tail=50
kubectl -n ilum logs job/init-rustfs-buckets --tail=200
kubectl -n ilum get secret ilum-objectstorage-credentials \
-o jsonpath='{.data.access-key}' | base64 -d; echo

Recovery

Populate the credentials Secret with all six aliased keys (access-key, secret-key, root-user, root-password, RUSTFS_ACCESS_KEY, RUSTFS_SECRET_KEY) and re-run the upgrade. The init Job is idempotent; it can be retried by deleting and re-applying via helm upgrade:

kubectl -n ilum delete job init-rustfs-buckets || true
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values

Credentials lookup error on helm upgrade

Symptom

helm upgrade fails at render time with a message similar to:

Error: ... values don't meet the specifications of the schema(s) ...
... ilum-objectstorage-credentials lookup is missing required keys ...

Likely cause

The chart resolves credentials in this order: live Secret values via lookup (when objectStorage.credentials.preserveExisting=true), then the literal defaults in values.yaml. When the live Secret exists but is missing one of the six aliased keys, the lookup returns an incomplete dictionary and the template fails the schema check.

Recovery

Either re-create the Secret with all six aliased keys, or disable the lookup and let the chart re-render the defaults:

# Option A: repopulate the Secret.
kubectl -n ilum delete secret ilum-objectstorage-credentials
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values

# Option B: force deterministic render (loses any rotated credentials).
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.credentials.preserveExisting=false

PVC bound to wrong StorageClass

Symptom

The provider's StatefulSet or Deployment stays Pending. The pod's events log a message similar to:

0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims

Likely cause

The chart-default storageClassName resolves to a class that does not match a CSI driver available on the cluster. This is common when moving the chart between cloud providers without overriding the storage class.

Recovery

Destructive

Deleting an existing PersistentVolumeClaim deletes the underlying volume on most CSI drivers. Use this recipe on net-new installs only.

Set the correct storage class and re-roll the PVCs:

kubectl -n ilum get storageclass
kubectl -n ilum delete pvc -l app.kubernetes.io/name=rustfs
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set rustfs.persistence.storageClass=<cluster-storage-class>

For pre-existing data, snapshot the source PVC and restore against the correct storage class before deletion. See Back Up and Restore Object Storage.

Post-cutover consumer still writes to the previous provider

Symptom

objectStorage.cutoverAcknowledged=true is set and mc diff confirms data parity, but one or more bundled consumers continue writing into the old provider's bucket.

Likely cause

The consumer cached its S3 endpoint at startup and has not refreshed since the cutover. The ilum-objectstorage Service alias re-targets the new provider instantly, but consumers that resolve the alias once on Pod startup do not pick up the change until they restart.

The Ilum UI rolls automatically when the helm upgrade regenerates the ilum-ui ConfigMap. Other consumers do not.

Recovery

Restart every consumer that targets the alias:

kubectl -n ilum rollout restart \
deploy/ilum-core \
deploy/ilum-jupyter \
deploy/ilum-mlflow \
deploy/ilum-kestra \
deploy/ilum-langfuse-web \
statefulset/ilum-hive-metastore

Long-running Spark driver Pods are unaffected: each Spark job creates its own S3 client and resolves the alias afresh.

Reference