Troubleshoot Object Storage
Overview
This page catalogs the symptoms an operator most commonly encounters
when something is off with the object-storage layer, the underlying
cause, and the recovery procedure. Each recipe ends in one or two
concrete kubectl or helm commands.
502 Bad Gateway from /external/object-storage/ or /external/minio/
Symptom
Loading http://<ingress>/external/object-storage/ or
http://<ingress>/external/minio/ returns 502 Bad Gateway from
nginx. The Object Storage view in the Ilum UI shows the gateway
error inside the iframe.
Likely cause
The ilum-objectstorage Service alias has no endpoints. The selector
points at a label that no pod carries.
Diagnosis
Inspect the alias annotation, selector, and endpoints:
kubectl -n ilum get svc ilum-objectstorage \
-o jsonpath='active-provider: {.metadata.annotations.ilum\.cloud/object-storage-active-provider}{"\n"}selector: {.spec.selector}{"\n"}'
kubectl -n ilum get endpoints ilum-objectstorage
If the endpoints column shows <none>, the selector does not match any
pod. Common causes:
objectStorage.activeProviderwas set to a name that does not match any running provider'sapp.kubernetes.io/namelabel.- The provider's chart was disabled (
<provider>.enabled=false) without flippingactiveProviderto a still-running provider. - A pre-upgrade override left the alias selector in an inconsistent state.
Recovery
Roll back to the last release revision whose values are known to be correct:
helm history ilum -n ilum
helm rollback ilum <revision> -n ilum
kubectl -n ilum rollout restart deploy/ilum-ui
Alternatively, override activeProvider to a still-running provider and
re-upgrade:
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.activeProvider=auto
kubectl -n ilum rollout restart deploy/ilum-ui
/external/object-storage/ redirects in a loop
Symptom
The browser keeps bouncing between /external/object-storage/ and the
provider-specific console path; the page never renders.
Likely cause
The active provider's consoleMode is nginx-rewrite and its
consolePath is /external/object-storage/ itself, so the redirect
sends the browser back to where it came from.
Recovery
Set the provider's consolePath to a provider-specific path so the
redirect target is distinct:
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.providers.<provider>.consolePath=/external/<provider>/
Object Storage nav button does not load
Symptom
Clicking the Object Storage entry in the Ilum UI loads a blank iframe or shows a "file not found" message.
Likely cause
ILUM_OBJECT_STORAGE_PATH in the ilum-ui ConfigMap resolves to a
path that the nginx proxy does not route, or no provider is active
and the path falls back to the chart-wide default
/external/object-storage/ which then 404s because no upstream is
configured.
Diagnosis
Inspect the runtime path the UI uses:
kubectl -n ilum get configmap ilum-ui \
-o jsonpath='ILUM_OBJECT_STORAGE_PATH={.data.ILUM_OBJECT_STORAGE_PATH}{"\n"}'
Cross-check against the nginx configuration for the matching location
block:
kubectl -n ilum exec deploy/ilum-ui -c ilum-ui -- \
grep -A5 'location /external/' /etc/nginx/conf.d/server.conf
Recovery
Ensure an in-cluster provider is enabled and either rely on the resolved
default or override objectStorage.providers.<provider>.consolePath
explicitly. Then restart the Ilum UI to pick up the new ConfigMap:
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set <provider>.enabled=true
kubectl -n ilum rollout restart deploy/ilum-ui
helm template fails with "3 providers enabled"
Symptom
A helm install or helm upgrade fails at render time with a message
similar to:
Error: ... objectStorage: 3 providers enabled ([minio rustfs seaweedfs]);
set objectStorage.activeProvider=<name> to pick which one user traffic
routes through
Likely cause
More than two providers are enabled simultaneously, and
objectStorage.activeProvider is left at auto. The chart refuses to
guess.
Recovery
Set the active provider explicitly:
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.activeProvider=<provider>
Alternatively, disable the providers that are not relevant to user
traffic by setting their enabled flags to false.
Alias has no endpoints despite a running provider
Symptom
A provider pod is running and ready, but kubectl get endpoints ilum-objectstorage shows <none>.
Likely cause
The pod's labels do not match the alias Service selector. The selector
requires both app.kubernetes.io/name: <provider> and
app.kubernetes.io/instance: <release>.
Diagnosis
kubectl -n ilum get pod -l app.kubernetes.io/name=<provider> \
-o jsonpath='{.items[*].metadata.labels}'
kubectl -n ilum get svc ilum-objectstorage -o jsonpath='{.spec.selector}'
Recovery
For pods deployed by a chart, ensure the chart sets both required
labels. For hand-rolled Deployments (such as those created by the
Add a New Provider procedure), patch the pod
template to include the missing labels and re-roll the Deployment.
Stuck pending-upgrade after a failed helm upgrade --wait
Symptom
helm history ilum shows a revision in pending-upgrade state. Every
subsequent helm upgrade fails immediately with a message similar to:
Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Likely cause
A previous helm upgrade --wait was interrupted (network drop, laptop
crash, Ctrl-C). The release Secret recording the in-flight upgrade
was never finalized.
Recovery
Delete the stuck release Secret and retry:
kubectl -n ilum get secret -l owner=helm,name=ilum
kubectl -n ilum delete secret sh.helm.release.v1.ilum.v<revision>
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values
The revision number is the highest one listed by helm history ilum
that is in pending-upgrade state.
Cutover acknowledged but the alias still targets the old provider
Symptom
objectStorage.cutoverAcknowledged=true is set (or its legacy alias
rustfs.migrationAcknowledged=true), but the alias annotation still
shows the previous provider.
Likely cause
Either the Ilum UI's ConfigMap was not regenerated (the
rollme: <random> annotation that forces a ilum-ui rollout did not
change), or the operator did not run helm upgrade after flipping the
flag.
Recovery
Re-run helm upgrade and force a UI rollout:
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.cutoverAcknowledged=true
kubectl -n ilum rollout restart deploy/ilum-ui
Verify by inspecting the alias annotation:
kubectl -n ilum get svc ilum-objectstorage \
-o jsonpath='{.metadata.annotations.ilum\.cloud/object-storage-active-provider}{"\n"}'
Bucket-init Job stays Pending or fails
Symptom
After helm install or helm upgrade, the init-rustfs-buckets or
init-minio-policies Job does not reach Complete. helm install --wait
times out, or the bundled consumers report missing buckets at startup.
Likely cause
One of the following:
- The
ilum-objectstorage-credentialsSecretis missing or has empty values foraccess-key/secret-key. - The provider's Service is reachable on cluster DNS but the provider
pod is not yet
Ready; the initJob'swait-for-<provider>init container is still looping. - The provider rejected the credentials (the bundled image baked in a
different default than the live
Secret).
Diagnosis
kubectl -n ilum logs job/init-rustfs-buckets -c wait-for-rustfs --tail=50
kubectl -n ilum logs job/init-rustfs-buckets --tail=200
kubectl -n ilum get secret ilum-objectstorage-credentials \
-o jsonpath='{.data.access-key}' | base64 -d; echo
Recovery
Populate the credentials Secret with all six aliased keys
(access-key, secret-key, root-user, root-password,
RUSTFS_ACCESS_KEY, RUSTFS_SECRET_KEY) and re-run the upgrade.
The init Job is idempotent; it can be retried by deleting and
re-applying via helm upgrade:
kubectl -n ilum delete job init-rustfs-buckets || true
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values
Credentials lookup error on helm upgrade
Symptom
helm upgrade fails at render time with a message similar to:
Error: ... values don't meet the specifications of the schema(s) ...
... ilum-objectstorage-credentials lookup is missing required keys ...
Likely cause
The chart resolves credentials in this order: live Secret values via
lookup (when objectStorage.credentials.preserveExisting=true), then
the literal defaults in values.yaml. When the live Secret exists
but is missing one of the six aliased keys, the lookup returns an
incomplete dictionary and the template fails the schema check.
Recovery
Either re-create the Secret with all six aliased keys, or disable the
lookup and let the chart re-render the defaults:
# Option A: repopulate the Secret.
kubectl -n ilum delete secret ilum-objectstorage-credentials
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values
# Option B: force deterministic render (loses any rotated credentials).
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set objectStorage.credentials.preserveExisting=false
PVC bound to wrong StorageClass
Symptom
The provider's StatefulSet or Deployment stays Pending. The pod's
events log a message similar to:
0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims
Likely cause
The chart-default storageClassName resolves to a class that does not
match a CSI driver available on the cluster. This is common when moving
the chart between cloud providers without overriding the storage class.
Recovery
Deleting an existing PersistentVolumeClaim deletes the underlying
volume on most CSI drivers. Use this recipe on net-new installs only.
Set the correct storage class and re-roll the PVCs:
kubectl -n ilum get storageclass
kubectl -n ilum delete pvc -l app.kubernetes.io/name=rustfs
helm upgrade ilum ilum/helm_aio -n ilum --reuse-values \
--set rustfs.persistence.storageClass=<cluster-storage-class>
For pre-existing data, snapshot the source PVC and restore against the correct storage class before deletion. See Back Up and Restore Object Storage.
Post-cutover consumer still writes to the previous provider
Symptom
objectStorage.cutoverAcknowledged=true is set and mc diff confirms
data parity, but one or more bundled consumers continue writing into
the old provider's bucket.
Likely cause
The consumer cached its S3 endpoint at startup and has not refreshed
since the cutover. The ilum-objectstorage Service alias re-targets
the new provider instantly, but consumers that resolve the alias once
on Pod startup do not pick up the change until they restart.
The Ilum UI rolls automatically when the helm upgrade regenerates
the ilum-ui ConfigMap. Other consumers do not.
Recovery
Restart every consumer that targets the alias:
kubectl -n ilum rollout restart \
deploy/ilum-core \
deploy/ilum-jupyter \
deploy/ilum-mlflow \
deploy/ilum-kestra \
deploy/ilum-langfuse-web \
statefulset/ilum-hive-metastore
Long-running Spark driver Pods are unaffected: each Spark job creates its own S3 client and resolves the alias afresh.
Reference
- Object Storage Overview for the alias model.
- Migrate Between Providers for the data migration playbook.
- Add a New Provider for plugging in new backends.
- Back Up and Restore Object Storage for data protection recipes.
- Object Storage Helm Values for the value reference.