Skip to main content

Back Up and Restore Object Storage

Overview

Ilum does not run automated backups of object storage by default. The bundled providers persist their data on PersistentVolumeClaims managed by the underlying CSI driver, and the chart preserves those PVCs across helm upgrade and helm rollback. Disaster recovery beyond PVC retention is operator-driven.

This page describes three layers of data protection, ordered from infrastructure-level to application-level:

  • PV snapshots via the Kubernetes VolumeSnapshot API. Point-in-time copies of the underlying volume; CSI-driver-dependent.
  • Off-cluster mc mirror copies to an external S3 backend. Logical-object-level mirrors that survive cluster loss.
  • Application-level table snapshots in Iceberg, Delta, and DuckLake. Time-travel semantics inside the table format; no infrastructure involvement.

The recipes below cover the active provider's bucket data. For recovering from misconfiguration without data loss, refer to Troubleshoot Object Storage.

Backup layers compared

LayerRPORTOCoverageCluster loss survives?
PV snapshotSnapshot interval (typically hourly)Minutes (restore + provider restart)All buckets, including metadata indicesNo (snapshot lives on the same storage backend)
Off-cluster mc mirrorMirror interval (typically hourly)Minutes (re-mirror to new cluster)All buckets at the S3 layerYes
Iceberg / Delta snapshotPer-commitSeconds (VERSION AS OF)One table at a timeOnly if the table's underlying objects survive

For production deployments, combine an off-cluster mc mirror job for disaster recovery with the table-format snapshots that Iceberg, Delta, and DuckLake already provide.

Layer 1: PV snapshots via the VolumeSnapshot API

The Kubernetes VolumeSnapshot API has been generally available since Kubernetes 1.20 (December 2020). Snapshot support is provided by the CSI driver and must be advertised by the driver itself; not every CSI driver implements snapshotting. For the upstream reference, see Volume Snapshots.

Verify CSI snapshot support

kubectl get csidriver -o custom-columns=NAME:.metadata.name,SNAP:.spec.attachRequired
kubectl get volumesnapshotclass

If volumesnapshotclass returns nothing, install the external-snapshotter controller and a VolumeSnapshotClass for the cluster's CSI driver before proceeding.

Create a VolumeSnapshotClass (one-time)

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ilum-objectstorage-snapshots
driver: <your-csi-driver>
deletionPolicy: Retain

deletionPolicy: Retain ensures snapshots survive the deletion of the VolumeSnapshot resource. Delete is appropriate when snapshots should be removed automatically with their parent resource.

Snapshot the active provider's PVC

# RustFS: the chart names the PVC after the StatefulSet.
PVC=$(kubectl -n ilum get pvc -l app.kubernetes.io/name=rustfs \
-o jsonpath='{.items[0].metadata.name}')

cat <<EOF | kubectl -n ilum apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ilum-objectstorage-$(date +%Y%m%d-%H%M%S)
spec:
volumeSnapshotClassName: ilum-objectstorage-snapshots
source:
persistentVolumeClaimName: $PVC
EOF

kubectl -n ilum get volumesnapshot

Restore from a snapshot

Create a new PersistentVolumeClaim whose dataSource references the VolumeSnapshot:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rustfs-restore
namespace: ilum
spec:
storageClassName: <your-storage-class>
dataSource:
name: ilum-objectstorage-<timestamp>
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: <same-as-source>

To swap the restored PVC into the running provider, scale the provider's StatefulSet to zero, repoint the active PVC at the restored volume, and scale back. The exact procedure depends on the CSI driver's reconciliation behavior; refer to the driver's documentation.

Limitations

  • VolumeSnapshot is CSI-driver-dependent. Not every cloud provider implements snapshot support in their CSI driver, and on-prem drivers vary in maturity. Verify before relying on this layer.
  • Snapshots typically live on the same underlying storage backend. A failure of the backend (region outage, hardware loss) takes the snapshots with it. Layer this with off-cluster mirrors for true DR.
  • ReadWriteOnce PVCs (the bundled-provider default) can be snapshotted without quiescing the provider, but the resulting snapshot is crash-consistent rather than application-consistent. For application-consistent snapshots, quiesce writes through the alias before triggering the snapshot.

Layer 2: Off-cluster mc mirror to external S3

mc mirror from the active provider to an external S3 backend produces a logical-object-level copy that survives full cluster loss. The typical pattern is a CronJob that mirrors every default bucket once per hour.

Provision an external destination

Provision an external S3 backend (AWS S3, Wasabi, Backblaze B2, or any S3-compatible service) with a bucket per source bucket. The destination bucket names should match objectStorage.defaultBuckets. For provider-specific endpoint shapes, refer to Provider Reference: External S3.

Store the destination credentials in a separate Secret so they do not conflict with ilum-objectstorage-credentials:

kubectl -n ilum create secret generic ilum-backup-credentials \
--from-literal=access-key=<external-access-key> \
--from-literal=secret-key=<external-secret-key>

Run the mirror as a CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
name: ilum-objectstorage-backup
namespace: ilum
spec:
schedule: "0 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 1
ttlSecondsAfterFinished: 86400
template:
spec:
restartPolicy: OnFailure
containers:
- name: mc
image: minio/mc:RELEASE.2025-04-16T18-13-26Z
envFrom:
- secretRef:
name: ilum-objectstorage-credentials
- secretRef:
name: ilum-backup-credentials
command: [sh, -c]
args:
- |
set -eu
mc alias set src http://ilum-objectstorage:9000 "$access-key" "$secret-key"
mc alias set dst https://<external-endpoint> "$access-key" "$secret-key"
for bucket in ilum-files ilum-data ilum-tables ilum-mlflow ilum-kestra ilum-ducklake ilum-langfuse; do
mc mb --ignore-existing dst/$bucket
mc mirror --preserve --remove src/$bucket dst/$bucket
done

The --remove flag mirrors deletions from source to destination. Omit it when an append-only archive is preferred. The bundled minio/mc image tag is pinned to the same release used by the in-cluster migration Job, ensuring behavior parity.

Restore from the external mirror

The restore procedure is mc mirror in reverse:

  1. Stand up a clean Ilum install with the active provider enabled but the bucket-init Job disabled (to avoid overwriting the restored objects).

  2. Configure the same external destination as a source alias.

  3. Mirror back to the in-cluster provider:

    for bucket in ilum-files ilum-data ilum-tables ilum-mlflow ilum-kestra ilum-ducklake ilum-langfuse; do
    mc mirror --preserve src/$bucket dst/$bucket
    done
  4. Re-enable the bundled consumers. The shared Secret and the ilum-objectstorage alias point them at the restored data automatically.

Layer 3: Application-level table snapshots

For Iceberg and Delta tables managed by Ilum, the table format itself provides point-in-time snapshots through its commit history. These cover one table at a time, not full buckets, but offer fine-grained recovery without infrastructure involvement.

Iceberg

-- List snapshots.
SELECT snapshot_id, committed_at, operation
FROM iceberg.<catalog>.<table>.snapshots
ORDER BY committed_at DESC;

-- Time-travel read.
SELECT *
FROM iceberg.<catalog>.<table>
VERSION AS OF <snapshot_id>;

-- Roll the table back to a snapshot.
CALL iceberg.system.rollback_to_snapshot('<catalog>.<table>', <snapshot_id>);

Delta Lake

DESCRIBE HISTORY <catalog>.<table>;
SELECT * FROM <catalog>.<table> VERSION AS OF <version>;
RESTORE TABLE <catalog>.<table> TO VERSION AS OF <version>;

DuckLake

DuckLake snapshots are recorded in the DuckLake catalog. Refer to the DuckLake documentation for the time-travel and rollback syntax that matches the catalog version in use.

The retention window of these snapshots is governed by the table format's expiration policy (Iceberg's expire_snapshots procedure, Delta's VACUUM). Tune the retention to match the operator's recovery objectives before relying on this layer for DR.

What is not backed up by default

  • PVC snapshots are not taken automatically. The chart does not create VolumeSnapshot resources; the operator must schedule them.
  • Off-cluster mirrors are not configured by default. The CronJob recipe above is operator-installed.
  • Bucket policies, lifecycle rules, and IAM users that the operator configures directly against the provider are not part of any layer above. Capture them separately (typically with a GitOps pipeline).
  • Hydra OIDC client registrations and Kubernetes Secrets are not covered by object-storage backups. Use cluster-wide tooling (Velero, Kasten K10, similar) for those.

Reference