DuckDB
DuckDB is an embedded analytical database that runs in-process inside ilum-core. It provides zero-overhead, single-node SQL execution for small-to-medium data and ad-hoc exploration. Combined with the DuckLake catalog, DuckDB is a first-class option for fast local analytics over object storage.
DuckDB is enabled by default in Ilum.
When to use DuckDB
DuckDB is the right engine for:
- Quick queries on small-to-medium datasets.
- Ad-hoc exploration where pod startup latency would be a bottleneck.
- Analytics over DuckLake-managed tables.
- Single-user, single-node workloads.
- Rapid prototyping before scaling out to Spark or Trino.
For distributed workloads on large data, prefer Apache Spark. For interactive analytics on large data with concurrent users, prefer Trino.
Execution model
DuckDB runs in-process with ilum-core:
- No driver pod, no executor pods, no network round-trips for query execution.
- Single-node parallelism via DuckDB's vectorized execution engine.
- Direct reads from object storage (MinIO, S3, GCS, Azure Blob, HDFS) without copying data into a cluster.
This model delivers sub-second response times on small queries that would otherwise be dominated by Spark or Trino startup overhead.
DuckLake catalog
DuckLake is a DuckDB-native catalog enabled by default in Ilum. Tables created through DuckLake are stored on S3-compatible object storage and accessible through DuckDB SQL with no additional configuration.
DuckLake is the default catalog for new DuckDB workloads. Hive Metastore tables remain accessible to DuckDB through standard catalog connectors.
Supported table formats
DuckDB reads and writes:
- Parquet: Native, with predicate pushdown and zone maps.
- CSV, JSON: Direct read with schema inference.
- DuckLake-managed tables: ACID writes through DuckLake.
- Delta Lake and Iceberg: Read access through DuckDB extensions.
Configuration
DuckDB and DuckLake are enabled out of the box. The relevant Helm values:
ilum-core:
sql:
duckdb:
enabled: true
idleTimeout: 1h
ducklake:
enabled: true
DuckLake table data is stored in MinIO (or any configured S3-compatible backend) at a path configurable through ilum-core.sql.duckdb.ducklake.path.
Extension management
The Ilum image ships with DuckDB extensions pre-baked so that runtime sessions never reach the DuckDB extension registry. Two mechanisms coexist:
- Pre-populated extension cache at
~/.duckdb/extensions/inside theilum-corecontainer. The standard extensionshttpfs,iceberg,postgres_scanner, andducklakeare placed here at image build time. DuckDB's autoload mechanism picks them up transparently the first time a session touches ans3://path, an Iceberg table, a PostgresATTACH, or a DuckLake catalog — noINSTALLorLOADis required, and no outbound call is made. - Local extension repository at
/duckdbExtinside the container, holdinghive_metastoreandduck_lineage. These are loaded explicitly by Ilum when a Hive metastore or Marquez lineage backend is configured.
| Extension | Source | How it loads |
|---|---|---|
httpfs | Pre-populated cache (~/.duckdb) | Autoload on first s3:// / https:// access or INSTALL httpfs; LOAD httpfs;. |
iceberg | Pre-populated cache (~/.duckdb) | Autoload on first iceberg_scan(...) call or INSTALL iceberg; LOAD iceberg;. |
postgres_scanner | Pre-populated cache (~/.duckdb) | Autoload on first Postgres ATTACH (including DuckLake's catalog) or INSTALL postgres_scanner; LOAD postgres_scanner;. |
ducklake | Pre-populated cache (~/.duckdb) | Autoload on ATTACH 'ducklake:...' or INSTALL ducklake; LOAD ducklake;. |
hive_metastore | Local repository (/duckdbExt) | Explicit INSTALL hive_metastore FROM '/duckdbExt'; LOAD hive_metastore; when a Hive metastore is configured. |
duck_lineage | Local repository (/duckdbExt) | Explicit INSTALL duck_lineage FROM '/duckdbExt'; LOAD duck_lineage; when Marquez is configured. |
The bare DuckDB form INSTALL <extension_name>; LOAD <extension_name>; continues to work for all of these. For the cache-backed set DuckDB resolves locally; for hive_metastore and duck_lineage DuckDB would otherwise reach community-extensions.duckdb.org, so the explicit FROM '/duckdbExt' form is used internally in air-gapped and MITM-restricted deployments.
Adding custom extensions
To bundle extensions beyond the default set — a custom community extension, a private build, or a community extension that Ilum does not pre-stage — use ilum-core.sql.duckdb.extraExtensions. Exactly one source must be configured: a PersistentVolumeClaim (recommended) or a node hostPath. Files must follow the DuckDB layout v<duckdb-version>/<platform>/<name>.duckdb_extension (for example v1.5.1/linux_amd64/myext.duckdb_extension).
The full schema of sql.duckdb.extraExtensions.* is documented on the ilum-core chart parameters page on ArtifactHub.
Source comparison
| Source | Holds the full v<ver>/<arch>/ tree? | Notes |
|---|---|---|
PersistentVolumeClaim | Yes — backed by a filesystem. | Recommended for production. Multi-arch / multi-version friendly. |
hostPath | Yes — backed by a node filesystem. | Single-node only; binds the deployment to a specific node. |
Example: PVC source
Create a small PVC in the same namespace as ilum-core:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ilum-duckdb-extra-extensions
namespace: ilum
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 200Mi
Populate it once via a temporary loader Pod that mounts the same claim. The Pod stays running long enough for the operator to copy files in, then is deleted:
apiVersion: v1
kind: Pod
metadata:
name: duckdb-ext-loader
namespace: ilum
spec:
restartPolicy: Never
containers:
- name: shell
image: busybox:1.36
command: ["sleep", "1800"]
volumeMounts:
- name: ext
mountPath: /data
volumes:
- name: ext
persistentVolumeClaim:
claimName: ilum-duckdb-extra-extensions
Copy the extension into the PVC and clean up:
kubectl -n ilum exec duckdb-ext-loader -- mkdir -p /data/v1.5.1/linux_amd64
kubectl -n ilum cp ./myext.duckdb_extension duckdb-ext-loader:/data/v1.5.1/linux_amd64/myext.duckdb_extension
kubectl -n ilum delete pod duckdb-ext-loader
The loader Pod must be deleted before ilum-core rolls out the new revision that mounts the PVC. ReadWriteOnce (the default access mode on most storage classes) allows only one Pod to attach the claim at a time.
Reference the PVC from helm_aio values:
ilum-core:
sql:
duckdb:
extraExtensions:
enabled: true
mountPath: "/duckdbExt-extra"
existingClaim: "ilum-duckdb-extra-extensions"
After helm upgrade, reference the extension from SQL:
INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;
Example: hostPath source
For single-node clusters (development, edge deployments) where a PVC is overkill, hostPath mounts a directory directly from the Kubernetes node's filesystem into the ilum-core container.
Prepare the directory on the node where ilum-core will run, with the extensions laid out in DuckDB's expected v<duckdb-version>/<platform>/ structure:
# On the Kubernetes node (e.g. via SSH):
sudo mkdir -p /srv/duckdb-extra/v1.5.1/linux_amd64
sudo cp ./myext.duckdb_extension /srv/duckdb-extra/v1.5.1/linux_amd64/
# Make the files readable by the ilum-core container (UID 1001 by default):
sudo chmod -R a+rX /srv/duckdb-extra
Reference the host path from helm values:
ilum-core:
sql:
duckdb:
extraExtensions:
enabled: true
mountPath: "/duckdbExt-extra"
hostPath: "/srv/duckdb-extra"
After helm upgrade, the SQL form is identical to the PVC case:
INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;
hostPath mounts pin the deployment to the node holding the files. If ilum-core is rescheduled to a different node, the mount will fail and the Pod will not start. Use a node selector or affinity rule to keep ilum-core on the prepared node, or migrate to a PVC for multi-node clusters.
This is the recommended mechanism in MITM-restricted or air-gapped environments where the DuckDB extension registry (extensions.duckdb.org) is unreachable. To make DuckDB's native HTTPS extensions (such as httpfs and aws) also trust an internal Certificate Authority in the same environment, follow the corporate MITM proxy walkthrough. For the broader deployment context, see the Air-gapped Installation Guide.
Selecting DuckDB in the SQL Editor
In the Ilum SQL Editor, the Engine Selector dropdown lets you choose DuckDB for any query. The engine status indicator confirms the in-process engine is ready.
When the automatic engine router is enabled, DuckDB is selected automatically for queries that target small datasets, DuckLake-managed tables, or ad-hoc exploration patterns.
Limitations
- DuckDB is single-node; it does not scale horizontally across executors.
- Query concurrency is bounded by the resources allocated to
ilum-core. - Long-running queries should use Spark or Trino instead, both for resource isolation and for failure recovery.