Back Up and Restore Object Storage
Overview
Ilum does not run automated backups of object storage by default.
The bundled providers persist their data on PersistentVolumeClaims
managed by the underlying CSI driver, and the chart preserves those
PVCs across helm upgrade and helm rollback. Disaster recovery
beyond PVC retention is operator-driven.
This page describes three layers of data protection, ordered from infrastructure-level to application-level:
- PV snapshots via the Kubernetes
VolumeSnapshotAPI. Point-in-time copies of the underlying volume; CSI-driver-dependent. - Off-cluster
mc mirrorcopies to an external S3 backend. Logical-object-level mirrors that survive cluster loss. - Application-level table snapshots in Iceberg, Delta, and DuckLake. Time-travel semantics inside the table format; no infrastructure involvement.
The recipes below cover the active provider's bucket data. For recovering from misconfiguration without data loss, refer to Troubleshoot Object Storage.
Backup layers compared
| Layer | RPO | RTO | Coverage | Cluster loss survives? |
|---|---|---|---|---|
| PV snapshot | Snapshot interval (typically hourly) | Minutes (restore + provider restart) | All buckets, including metadata indices | No (snapshot lives on the same storage backend) |
Off-cluster mc mirror | Mirror interval (typically hourly) | Minutes (re-mirror to new cluster) | All buckets at the S3 layer | Yes |
| Iceberg / Delta snapshot | Per-commit | Seconds (VERSION AS OF) | One table at a time | Only if the table's underlying objects survive |
For production deployments, combine an off-cluster mc mirror job for
disaster recovery with the table-format snapshots that Iceberg, Delta,
and DuckLake already provide.
Layer 1: PV snapshots via the VolumeSnapshot API
The Kubernetes VolumeSnapshot API has been generally available since
Kubernetes 1.20 (December 2020). Snapshot support is provided by the
CSI driver and must be advertised by the driver itself; not every CSI
driver implements snapshotting. For the upstream reference, see
Volume Snapshots.
Verify CSI snapshot support
kubectl get csidriver -o custom-columns=NAME:.metadata.name,SNAP:.spec.attachRequired
kubectl get volumesnapshotclass
If volumesnapshotclass returns nothing, install the
external-snapshotter controller and a VolumeSnapshotClass for the
cluster's CSI driver before proceeding.
Create a VolumeSnapshotClass (one-time)
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ilum-objectstorage-snapshots
driver: <your-csi-driver>
deletionPolicy: Retain
deletionPolicy: Retain ensures snapshots survive the deletion of the
VolumeSnapshot resource. Delete is appropriate when snapshots
should be removed automatically with their parent resource.
Snapshot the active provider's PVC
# RustFS: the chart names the PVC after the StatefulSet.
PVC=$(kubectl -n ilum get pvc -l app.kubernetes.io/name=rustfs \
-o jsonpath='{.items[0].metadata.name}')
cat <<EOF | kubectl -n ilum apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ilum-objectstorage-$(date +%Y%m%d-%H%M%S)
spec:
volumeSnapshotClassName: ilum-objectstorage-snapshots
source:
persistentVolumeClaimName: $PVC
EOF
kubectl -n ilum get volumesnapshot
Restore from a snapshot
Create a new PersistentVolumeClaim whose dataSource references the
VolumeSnapshot:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rustfs-restore
namespace: ilum
spec:
storageClassName: <your-storage-class>
dataSource:
name: ilum-objectstorage-<timestamp>
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: <same-as-source>
To swap the restored PVC into the running provider, scale the
provider's StatefulSet to zero, repoint the active PVC at the
restored volume, and scale back. The exact procedure depends on the
CSI driver's reconciliation behavior; refer to the driver's
documentation.
Limitations
VolumeSnapshotis CSI-driver-dependent. Not every cloud provider implements snapshot support in their CSI driver, and on-prem drivers vary in maturity. Verify before relying on this layer.- Snapshots typically live on the same underlying storage backend. A failure of the backend (region outage, hardware loss) takes the snapshots with it. Layer this with off-cluster mirrors for true DR.
ReadWriteOncePVCs (the bundled-provider default) can be snapshotted without quiescing the provider, but the resulting snapshot is crash-consistent rather than application-consistent. For application-consistent snapshots, quiesce writes through the alias before triggering the snapshot.
Layer 2: Off-cluster mc mirror to external S3
mc mirror from the active provider to an external S3 backend produces
a logical-object-level copy that survives full cluster loss. The
typical pattern is a CronJob that mirrors every default bucket once
per hour.
Provision an external destination
Provision an external S3 backend (AWS S3, Wasabi, Backblaze B2, or any
S3-compatible service) with a bucket per source bucket. The destination
bucket names should match objectStorage.defaultBuckets. For
provider-specific endpoint shapes, refer to
Provider Reference: External S3.
Store the destination credentials in a separate Secret so they do
not conflict with ilum-objectstorage-credentials:
kubectl -n ilum create secret generic ilum-backup-credentials \
--from-literal=access-key=<external-access-key> \
--from-literal=secret-key=<external-secret-key>
Run the mirror as a CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: ilum-objectstorage-backup
namespace: ilum
spec:
schedule: "0 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 1
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: OnFailure
containers:
- name: mc
image: minio/mc:RELEASE.2025-04-16T18-13-26Z
envFrom:
- secretRef:
name: ilum-objectstorage-credentials
- secretRef:
name: ilum-backup-credentials
command: [sh, -c]
args:
- |
set -eu
mc alias set src http://ilum-objectstorage:9000 "$access-key" "$secret-key"
mc alias set dst https://<external-endpoint> "$access-key" "$secret-key"
for bucket in ilum-files ilum-data ilum-tables ilum-mlflow ilum-kestra ilum-ducklake ilum-langfuse; do
mc mb --ignore-existing dst/$bucket
mc mirror --preserve --remove src/$bucket dst/$bucket
done
The --remove flag mirrors deletions from source to destination. Omit
it when an append-only archive is preferred. The bundled minio/mc
image tag is pinned to the same release used by the in-cluster
migration Job, ensuring behavior parity.
Restore from the external mirror
The restore procedure is mc mirror in reverse:
-
Stand up a clean Ilum install with the active provider enabled but the bucket-init
Jobdisabled (to avoid overwriting the restored objects). -
Configure the same external destination as a source alias.
-
Mirror back to the in-cluster provider:
for bucket in ilum-files ilum-data ilum-tables ilum-mlflow ilum-kestra ilum-ducklake ilum-langfuse; do
mc mirror --preserve src/$bucket dst/$bucket
done -
Re-enable the bundled consumers. The shared
Secretand theilum-objectstoragealias point them at the restored data automatically.
Layer 3: Application-level table snapshots
For Iceberg and Delta tables managed by Ilum, the table format itself provides point-in-time snapshots through its commit history. These cover one table at a time, not full buckets, but offer fine-grained recovery without infrastructure involvement.
Iceberg
-- List snapshots.
SELECT snapshot_id, committed_at, operation
FROM iceberg.<catalog>.<table>.snapshots
ORDER BY committed_at DESC;
-- Time-travel read.
SELECT *
FROM iceberg.<catalog>.<table>
VERSION AS OF <snapshot_id>;
-- Roll the table back to a snapshot.
CALL iceberg.system.rollback_to_snapshot('<catalog>.<table>', <snapshot_id>);
Delta Lake
DESCRIBE HISTORY <catalog>.<table>;
SELECT * FROM <catalog>.<table> VERSION AS OF <version>;
RESTORE TABLE <catalog>.<table> TO VERSION AS OF <version>;
DuckLake
DuckLake snapshots are recorded in the DuckLake catalog. Refer to the DuckLake documentation for the time-travel and rollback syntax that matches the catalog version in use.
The retention window of these snapshots is governed by the table
format's expiration policy (Iceberg's expire_snapshots procedure,
Delta's VACUUM). Tune the retention to match the operator's recovery
objectives before relying on this layer for DR.
What is not backed up by default
- PVC snapshots are not taken automatically. The chart does not
create
VolumeSnapshotresources; the operator must schedule them. - Off-cluster mirrors are not configured by default. The
CronJobrecipe above is operator-installed. - Bucket policies, lifecycle rules, and IAM users that the operator configures directly against the provider are not part of any layer above. Capture them separately (typically with a GitOps pipeline).
- Hydra OIDC client registrations and Kubernetes Secrets are not covered by object-storage backups. Use cluster-wide tooling (Velero, Kasten K10, similar) for those.
Reference
- Kubernetes
VolumeSnapshotreference: kubernetes.io/docs/concepts/storage/volume-snapshots/ external-snapshottercontroller: github.com/kubernetes-csi/external-snapshottermcclient reference: min.io/docs/minio/linux/reference/minio-mc.html- Iceberg maintenance procedures: iceberg.apache.org/docs/latest/maintenance/
- Delta Lake time travel: docs.delta.io/latest/quick-start.html#read-older-versions-of-data-using-time-travel
- Migration playbook: Migrate Between Providers
- Provider reference: External S3