Skip to main content

Data Lifecycle Management

Overview

As data volumes grow, managing the lifecycle of stored data becomes critical for controlling storage costs, maintaining query performance, and meeting compliance requirements. Without active lifecycle management, data lakes accumulate stale snapshots, orphan files, and small fragmented data files that degrade both performance and cost efficiency.

Ilum provides data lifecycle management through the native maintenance operations of its supported table formats -- Apache Iceberg, Delta Lake, and Apache Hudi -- combined with automation through orchestration tools. Rather than introducing a proprietary abstraction, Ilum lets you execute standard SQL maintenance commands via the SQL Viewer and schedule them using the built-in Scheduler, Airflow, n8n, or Kestra.

This page covers:

  • Table compaction to merge small files and improve read performance
  • Orphan file cleanup to reclaim storage from failed or interrupted operations
  • Snapshot and metadata management to control metadata growth
  • Retention policies to automate expiration of old data and snapshots
  • Storage tiering to optimize cost by placing data on appropriate storage backends
  • Automation patterns to schedule maintenance as recurring workflows

Table Compaction

Frequent writes, updates, and deletes can produce many small files, which increases the overhead of file listing and metadata management, and reduces query throughput. Compaction merges these small files into larger, optimally sized files.

Iceberg

Use the rewrite_data_files procedure to compact an Iceberg table:

CALL spark_catalog.system.rewrite_data_files(
table => 'db.my_table'
)

You can also filter which partitions to compact:

CALL spark_catalog.system.rewrite_data_files(
table => 'db.my_table',
where => 'event_date >= "2025-01-01"'
)
tip

Compaction is a rewrite operation and will produce new snapshots. Plan to run snapshot expiration after compaction to avoid metadata bloat.

Delta Lake

Delta Lake provides the OPTIMIZE command for compaction:

OPTIMIZE db.my_table

For tables with high-cardinality filter columns, use ZORDER BY to co-locate related data within the same files, improving query performance on those columns:

OPTIMIZE db.my_table ZORDER BY (customer_id, order_date)

Hudi

Apache Hudi Merge on Read (MoR) tables use compaction to merge log files with base files, producing new file slices. Compaction can run inline (during writes) or asynchronously.

For inline compaction, set these table properties at creation time:

CREATE TABLE db.my_hudi_table (...) USING HUDI
TBLPROPERTIES (
type = 'mor',
primaryKey = 'id',
'hoodie.compact.inline' = 'true',
'hoodie.compact.inline.max.delta.commits' = '5'
)

For asynchronous compaction, set hoodie.compact.inline to false and schedule a separate compaction job. Refer to the Apache Hudi documentation for details on async compaction strategies.

Trino

When using Trino as your SQL engine, you can optimize both Iceberg and Delta Lake tables with the optimize procedure:

ALTER TABLE db.my_table EXECUTE optimize

For Iceberg tables, you can also specify a file size threshold:

ALTER TABLE db.my_table EXECUTE optimize(file_size_threshold => '128MB')

When to Compact

  • After bulk ingestion that produces many small files
  • After a series of updates or deletes that fragment existing data files
  • On a recurring schedule (e.g., nightly or weekly) for tables with continuous writes
warning

Compaction is resource-intensive. Schedule it during off-peak hours and allocate sufficient executor memory to avoid out-of-memory failures. For large tables, compact partition by partition rather than the entire table at once.


Orphan File Cleanup

Orphan files are data files that exist in storage but are no longer referenced by any table metadata. They can result from:

  • Failed or interrupted write operations
  • Schema evolution operations that create new files before dropping old references
  • Concurrent writes where one transaction is aborted

These files consume storage without serving any purpose.

Iceberg

Use the remove_orphan_files procedure to clean up unreferenced files:

CALL spark_catalog.system.remove_orphan_files(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00'
)
warning

Always set older_than to a timestamp at least 3 days in the past. Files created by in-progress writes may not yet be referenced in metadata. Removing them prematurely can corrupt active operations.

To preview which files would be removed without actually deleting them, use the dry_run option:

CALL spark_catalog.system.remove_orphan_files(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00',
dry_run => true
)

Delta Lake

Delta Lake handles orphan file removal through the VACUUM command (covered in Snapshot & Metadata Management). Files not referenced by the transaction log and older than the retention threshold are removed during vacuum.

Scheduling Cleanup

Orphan file cleanup should run on a regular schedule. Use the ilum Scheduler to set up a recurring Spark job that executes cleanup procedures, or integrate it into an Airflow DAG alongside other maintenance tasks.


Snapshot & Metadata Management

Table formats maintain snapshots and metadata for features like time travel and rollback. Over time, accumulating snapshots increases metadata size, slows down query planning, and consumes additional storage for data files that are only retained by old snapshots.

Iceberg Snapshot Expiration

Expire old snapshots to remove metadata entries and allow referenced data files to be cleaned up:

CALL spark_catalog.system.expire_snapshots(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00'
)

You can also limit the number of snapshots to retain:

CALL spark_catalog.system.expire_snapshots(
table => 'db.my_table',
older_than => TIMESTAMP '2025-12-01 00:00:00',
retain_last => 10
)

Iceberg also supports automatic metadata cleanup via table properties:

ALTER TABLE db.my_table SET TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '10'
)

Delta Lake VACUUM

The VACUUM command removes data files that are no longer referenced by the Delta transaction log and are older than the retention threshold:

VACUUM db.my_table RETAIN 168 HOURS

Use DRY RUN to preview the files that would be deleted:

VACUUM db.my_table RETAIN 168 HOURS DRY RUN
note

The default Delta Lake retention period is 7 days (168 hours). Setting it below this threshold requires disabling a safety check, which is not recommended as it can break time travel queries and concurrent reads.

For more details on time travel, snapshots, and tagging in each table format, see the SQL Viewer documentation.


Retention Policies

Retention policies define how long data, snapshots, and log files are kept. Configuring retention at the table level ensures that maintenance operations automatically respect these boundaries.

Iceberg Retention Properties

Set table properties to control automatic snapshot expiration:

ALTER TABLE db.my_table SET TBLPROPERTIES (
'history.expire.max-snapshot-age-ms' = '604800000', -- 7 days
'history.expire.min-snapshots-to-keep' = '5'
)

These properties are respected by the expire_snapshots procedure when called without explicit older_than or retain_last arguments.

Delta Lake Retention Properties

Configure retention for deleted files and transaction log entries:

ALTER TABLE db.my_table SET TBLPROPERTIES (
'delta.deletedFileRetentionDuration' = 'interval 30 days',
'delta.logRetentionDuration' = 'interval 90 days'
)
  • deletedFileRetentionDuration controls how long deleted data files are kept before VACUUM can remove them.
  • logRetentionDuration controls how long transaction log entries are retained, which determines the time travel window.

Building a Retention Schedule

A practical retention strategy often combines multiple tiers based on data age:

TierAgeStorageAction
Hot0--30 daysFast SSD (MinIO local)Full snapshot retention, no cleanup
Warm30--90 daysS3 / GCSExpire snapshots older than 30 days, compact weekly
Cold90 days -- 1 yearS3 Infrequent Access / ArchiveRetain only latest snapshot, vacuum aggressively
Purge> 1 year--Drop tables or partitions per compliance policy

Implement this schedule as a recurring Airflow DAG that runs different maintenance procedures based on the age of each partition. See Automation Patterns for examples.


Storage Tiering

Ilum supports multiple storage backends -- MinIO (default), S3, GCS, WASBS, and HDFS -- which can be configured per cluster. By leveraging multiple storage backends, you can place frequently accessed data on fast local storage and move older or infrequently accessed data to cheaper object storage.

For instructions on adding storage backends to your clusters, see Clusters and Storages.

Iceberg Storage Location

Iceberg tables support configuring the write location per table, allowing you to direct new data to a specific storage tier:

ALTER TABLE db.my_table SET TBLPROPERTIES (
'write.data.path' = 's3a://cold-storage-bucket/warehouse/my_table/data'
)

For partitioned tables, you can implement tier-based placement by writing different partitions to different storage locations. For example, use a Spark job that reads recent partitions from warm storage and rewrites them to cold storage:

spark.sql("""
INSERT OVERWRITE db.my_table PARTITION (event_date)
SELECT * FROM db.my_table
WHERE event_date < '2025-01-01'
""")

After rewriting, update the table's data path for new writes to target the appropriate tier.

Partition-Based Tiering Strategy

A common approach is to use date partitions to separate data across tiers:

  1. Recent partitions (last 30 days) remain on fast local MinIO storage for low-latency queries
  2. Older partitions (30--365 days) are moved to S3 or GCS for cost-effective storage
  3. Archived partitions (> 1 year) are moved to infrequent-access storage classes or deleted

Automate this movement with scheduled Spark jobs that:

  • Read data from the source tier
  • Write data to the destination tier
  • Update table metadata to reflect the new file locations

Delta Lake Storage Management

Delta Lake tables use the location specified at table creation. To implement tiering, create separate tables or use CREATE TABLE ... LOCATION to control placement:

CREATE TABLE db.cold_archive
USING DELTA
LOCATION 's3a://cold-storage-bucket/warehouse/cold_archive'
AS SELECT * FROM db.my_table WHERE event_date < '2025-01-01'

Automation Patterns

Manual execution of lifecycle operations is error-prone and unsustainable. Ilum provides several automation options for scheduling recurring maintenance.

Airflow DAG

Build an Airflow DAG that runs weekly maintenance across your tables. Submit maintenance SQL via the LivyOperator to execute on a Spark cluster managed by Ilum.

from airflow import DAG
from airflow.providers.apache.livy.operators.livy import LivyOperator
from datetime import datetime

with DAG(
dag_id='weekly_table_maintenance',
schedule='0 2 * * 0', # Every Sunday at 2 AM
start_date=datetime(2025, 1, 1),
catchup=False,
) as dag:

compact = LivyOperator(
task_id='compact_tables',
file='s3a://ilum-files/maintenance/compact.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)

expire_snapshots = LivyOperator(
task_id='expire_snapshots',
file='s3a://ilum-files/maintenance/expire_snapshots.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)

cleanup = LivyOperator(
task_id='orphan_cleanup',
file='s3a://ilum-files/maintenance/cleanup.py',
livy_conn_id='ilum-livy-proxy',
polling_interval=30,
)

compact >> expire_snapshots >> cleanup

The maintenance scripts (e.g., compact.py) contain the SQL calls shown in earlier sections, executed via spark.sql(...). For complete details on configuring Airflow with Ilum, see the Airflow documentation.

ilum Built-in Scheduler

For simpler recurring tasks, use the ilum Scheduler to create a scheduled Spark job that runs maintenance SQL. This avoids the overhead of deploying a full orchestration platform.

  1. Write a PySpark script that executes your maintenance procedures
  2. Upload the script to your storage (e.g., MinIO)
  3. Create a schedule in the ilum UI with an appropriate CRON expression (e.g., 0 3 * * 0 for weekly Sunday at 3 AM)

n8n Workflow

With n8n, you can build a visual workflow that triggers Spark maintenance jobs through Ilum's API. This is useful when maintenance should be triggered by external events (e.g., after a data pipeline completes) rather than on a fixed schedule.

Kestra Pipeline

Kestra provides a declarative YAML-based approach to defining maintenance pipelines:

id: iceberg-maintenance
namespace: ilum.maintenance
tasks:
- id: compact
type: io.kestra.plugin.spark.SparkCLI
master: "k8s://https://ilum-core:9888"
args:
- "--class"
- "org.apache.spark.sql.IcebergMaintenance"
- "s3a://ilum-files/maintenance/iceberg-maintenance.jar"
triggers:
- id: weekly
type: io.kestra.plugin.core.trigger.Schedule
cron: "0 4 * * 0"
OperationFrequencyNotes
Table compactionWeekly or after large ingestionsPartition-level for large tables
Snapshot expirationWeeklyRun after compaction
Orphan file cleanupWeekly or bi-weeklySet older_than to at least 3 days ago
VACUUM (Delta)WeeklyKeep retention at 7+ days
Storage tier migrationMonthlyMove cold partitions to cheaper storage
tip

Combine related operations into a single scheduled job or DAG to reduce cluster overhead. Run compaction first, then snapshot expiration, then orphan file cleanup -- this order ensures that files produced by compaction are properly tracked before cleanup runs.