Data Lifecycle Management
Overview
As data volumes grow, managing the lifecycle of stored data becomes critical for controlling storage costs, maintaining query performance, and meeting compliance requirements. Without active lifecycle management, data lakes accumulate stale snapshots, orphan files, and small fragmented data files that degrade both performance and cost efficiency.
Ilum provides data lifecycle management through the native maintenance operations of its supported table formats -- Apache Iceberg, Delta Lake, and Apache Hudi -- combined with automation through orchestration tools. Rather than introducing a proprietary abstraction, Ilum lets you execute standard SQL maintenance commands via the SQL Viewer and schedule them using the built-in Scheduler, Airflow, n8n, or Kestra.
This page covers:
- Table compaction to merge small files and improve read performance
- Orphan file cleanup to reclaim storage from failed or interrupted operations
- Snapshot and metadata management to control metadata growth
- Retention policies to automate expiration of old data and snapshots
- Storage tiering to optimize cost by placing data on appropriate storage backends
- Automation patterns to schedule maintenance as recurring workflows
Table Compaction
Frequent writes, updates, and deletes can produce many small files, which increases the overhead of file listing and metadata management, and reduces query throughput. Compaction merges these small files into larger, optimally sized files.
Iceberg
Use the rewrite_data_files procedure to compact an Iceberg table:
CALL spark_catalog.system.rewrite_data_files(
table => 'db.my_table'
)
You can also filter which partitions to compact:
CALL spark_catalog.system.rewrite_data_files(
table => 'db.my_table',
where => 'event_date >= "2025-01-01"'
)
Compaction is a rewrite operation and will produce new snapshots. Plan to run snapshot expiration after compaction to avoid metadata bloat.
Delta Lake
Delta Lake provides the OPTIMIZE command for compaction:
OPTIMIZE db.my_table
For tables with high-cardinality filter columns, use ZORDER BY to co-locate related data within the same files, improving query performance on those columns:
OPTIMIZE db.my_table ZORDER BY (customer_id, order_date)
Hudi
Apache Hudi Merge on Read (MoR) tables use compaction to merge log files with base files, producing new file slices. Compaction can run inline (during writes) or asynchronously.
For inline compaction, set these table properties at creation time:
CREATE TABLE db.my_hudi_table (...) USING HUDI
TBLPROPERTIES (
type = 'mor',
primaryKey = 'id',
'hoodie.compact.inline' = 'true',
'hoodie.compact.inline.max.delta.commits' = '5'
)
For asynchronous compaction, set hoodie.compact.inline to false and schedule a separate compaction job. Refer to the Apache Hudi documentation for details on async compaction strategies.
Trino
When using Trino as your SQL engine, you can optimize both Iceberg and Delta Lake tables with the optimize procedure:
ALTER TABLE db.my_table EXECUTE optimize
For Iceberg tables, you can also specify a file size threshold:
ALTER TABLE db.my_table EXECUTE optimize(file_size_threshold => '128MB')
When to Compact
- After bulk ingestion that produces many small files
- After a series of updates or deletes that fragment existing data files
- On a recurring schedule (e.g., nightly or weekly) for tables with continuous writes
Compaction is resource-intensive. Schedule it during off-peak hours and allocate sufficient executor memory to avoid out-of-memory failures. For large tables, compact partition by partition rather than the entire table at once.
Orphan File Cleanup
Orphan files are data files that exist in storage but are no longer referenced by any table metadata. They can result from:
- Failed or interrupted write operations
- Schema evolution operations that create new files before dropping old references
- Concurrent writes where one transaction is aborted
These files consume storage without serving any purpose.
Iceberg
Use the remove_orphan_files procedure to clean up unreferenced files:
CALL spark_catalog.system.remove_orphan_files(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00'
)
Always set older_than to a timestamp at least 3 days in the past. Files created by in-progress writes may not yet be referenced in metadata. Removing them prematurely can corrupt active operations.
To preview which files would be removed without actually deleting them, use the dry_run option:
CALL spark_catalog.system.remove_orphan_files(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00',
dry_run => true
)
Delta Lake
Delta Lake handles orphan file removal through the VACUUM command (covered in Snapshot & Metadata Management). Files not referenced by the transaction log and older than the retention threshold are removed during vacuum.
Scheduling Cleanup
Orphan file cleanup should run on a regular schedule. Use the ilum Scheduler to set up a recurring Spark job that executes cleanup procedures, or integrate it into an Airflow DAG alongside other maintenance tasks.
Snapshot & Metadata Management
Table formats maintain snapshots and metadata for features like time travel and rollback. Over time, accumulating snapshots increases metadata size, slows down query planning, and consumes additional storage for data files that are only retained by old snapshots.
Iceberg Snapshot Expiration
Expire old snapshots to remove metadata entries and allow referenced data files to be cleaned up:
CALL spark_catalog.system.expire_snapshots(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00'
)
You can also limit the number of snapshots to retain:
CALL spark_catalog.system.expire_snapshots(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00',
retain_last => 10
)
Iceberg also supports automatic metadata cleanup via table properties:
ALTER TABLE db.my_table SET TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '10'
)
Delta Lake VACUUM
The VACUUM command removes data files that are no longer referenced by the Delta transaction log and are older than the retention threshold:
VACUUM db.my_table RETAIN 168 HOURS
Use DRY RUN to preview the files that would be deleted:
VACUUM db.my_table RETAIN 168 HOURS DRY RUN
The default Delta Lake retention period is 7 days (168 hours). Setting it below this threshold requires disabling a safety check, which is not recommended as it can break time travel queries and concurrent reads.
For more details on time travel, snapshots, and tagging in each table format, see the SQL Viewer documentation.
Retention Policies
Retention policies define how long data, snapshots, and log files are kept. Configuring retention at the table level ensures that maintenance operations automatically respect these boundaries.
Iceberg Retention Properties
Set table properties to control automatic snapshot expiration:
ALTER TABLE db.my_table SET TBLPROPERTIES (
'history.expire.max-snapshot-age-ms' = '604800000', -- 7 days
'history.expire.min-snapshots-to-keep' = '5'
)
These properties are respected by the expire_snapshots procedure when called without explicit older_than or retain_last arguments.
Delta Lake Retention Properties
Configure retention for deleted files and transaction log entries:
ALTER TABLE db.my_table SET TBLPROPERTIES (
'delta.deletedFileRetentionDuration' = 'interval 30 days',
'delta.logRetentionDuration' = 'interval 90 days'
)
deletedFileRetentionDurationcontrols how long deleted data files are kept beforeVACUUMcan remove them.logRetentionDurationcontrols how long transaction log entries are retained, which determines the time travel window.
Building a Retention Schedule
A practical retention strategy often combines multiple tiers based on data age:
| Tier | Age | Storage | Action |
|---|---|---|---|
| Hot | 0--30 days | Fast SSD (MinIO local) | Full snapshot retention, no cleanup |
| Warm | 30--90 days | S3 / GCS | Expire snapshots older than 30 days, compact weekly |
| Cold | 90 days -- 1 year | S3 Infrequent Access / Archive | Retain only latest snapshot, vacuum aggressively |
| Purge | > 1 year | -- | Drop tables or partitions per compliance policy |
Implement this schedule as a recurring Airflow DAG that runs different maintenance procedures based on the age of each partition. See Automation Patterns for examples.
Storage Tiering
Ilum supports multiple storage backends -- MinIO (default), S3, GCS, WASBS, and HDFS -- which can be configured per cluster. By leveraging multiple storage backends, you can place frequently accessed data on fast local storage and move older or infrequently accessed data to cheaper object storage.
For instructions on adding storage backends to your clusters, see Clusters and Storages.
Iceberg Storage Location
Iceberg tables support configuring the write location per table, allowing you to direct new data to a specific storage tier:
ALTER TABLE db.my_table SET TBLPROPERTIES (
'write.data.path' = 's3a://cold-storage-bucket/warehouse/my_table/data'
)
For partitioned tables, you can implement tier-based placement by writing different partitions to different storage locations. For example, use a Spark job that reads recent partitions from warm storage and rewrites them to cold storage:
spark.sql("""
INSERT OVERWRITE db.my_table PARTITION (event_date)
SELECT * FROM db.my_table
WHERE event_date < '2025-01-01'
""")
After rewriting, update the table's data path for new writes to target the appropriate tier.
Partition-Based Tiering Strategy
A common approach is to use date partitions to separate data across tiers:
- Recent partitions (last 30 days) remain on fast local MinIO storage for low-latency queries
- Older partitions (30--365 days) are moved to S3 or GCS for cost-effective storage
- Archived partitions (> 1 year) are moved to infrequent-access storage classes or deleted
Automate this movement with scheduled Spark jobs that:
- Read data from the source tier
- Write data to the destination tier
- Update table metadata to reflect the new file locations
Delta Lake Storage Management
Delta Lake tables use the location specified at table creation. To implement tiering, create separate tables or use CREATE TABLE ... LOCATION to control placement:
CREATE TABLE db.cold_archive
USING DELTA
LOCATION 's3a://cold-storage-bucket/warehouse/cold_archive'
AS SELECT * FROM db.my_table WHERE event_date < '2025-01-01'
Automation Patterns
Manual execution of lifecycle operations is error-prone and unsustainable. Ilum provides several automation options for scheduling recurring maintenance.
Airflow DAG
Build an Airflow DAG that runs weekly maintenance across your tables. Submit maintenance SQL via the LivyOperator to execute on a Spark cluster managed by Ilum.
from airflow import DAG
from airflow.providers.apache.livy.operators.livy import LivyOperator
from datetime import datetime
with DAG(
dag_id='weekly_table_maintenance',
schedule='0 2 * * 0', # Every Sunday at 2 AM
start_date=datetime(2025, 1, 1),
catchup=False,
) as dag:
compact = LivyOperator(
task_id='compact_tables',
file='s3a://ilum-files/maintenance/compact.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)
expire_snapshots = LivyOperator(
task_id='expire_snapshots',
file='s3a://ilum-files/maintenance/expire_snapshots.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)
cleanup = LivyOperator(
task_id='orphan_cleanup',
file='s3a://ilum-files/maintenance/cleanup.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)
compact >> expire_snapshots >> cleanup
The maintenance scripts (e.g., compact.py) contain the SQL calls shown in earlier sections, executed via spark.sql(...). For complete details on configuring Airflow with Ilum, see the Airflow documentation.
ilum Built-in Scheduler
For simpler recurring tasks, use the ilum Scheduler to create a scheduled Spark job that runs maintenance SQL. This avoids the overhead of deploying a full orchestration platform.
- Write a PySpark script that executes your maintenance procedures
- Upload the script to your storage (e.g., MinIO)
- Create a schedule in the ilum UI with an appropriate CRON expression (e.g.,
0 3 * * 0for weekly Sunday at 3 AM)
n8n Workflow
With n8n, you can build a visual workflow that triggers Spark maintenance jobs through Ilum's API. This is useful when maintenance should be triggered by external events (e.g., after a data pipeline completes) rather than on a fixed schedule.
Kestra Pipeline
Kestra provides a declarative YAML-based approach to defining maintenance pipelines:
id: iceberg-maintenance
namespace: ilum.maintenance
tasks:
- id: compact
type: io.kestra.plugin.spark.SparkCLI
master: "k8s://https://ilum-core:9888"
args:
- "--class"
- "org.apache.spark.sql.IcebergMaintenance"
- "s3a://ilum-files/maintenance/iceberg-maintenance.jar"
triggers:
- id: weekly
type: io.kestra.plugin.core.trigger.Schedule
cron: "0 4 * * 0"
Recommended Maintenance Schedule
| Operation | Frequency | Notes |
|---|---|---|
| Table compaction | Weekly or after large ingestions | Partition-level for large tables |
| Snapshot expiration | Weekly | Run after compaction |
| Orphan file cleanup | Weekly or bi-weekly | Set older_than to at least 3 days ago |
| VACUUM (Delta) | Weekly | Keep retention at 7+ days |
| Storage tier migration | Monthly | Move cold partitions to cheaper storage |
Combine related operations into a single scheduled job or DAG to reduce cluster overhead. Run compaction first, then snapshot expiration, then orphan file cleanup -- this order ensures that files produced by compaction are properly tracked before cleanup runs.