Skip to main content

DuckDB

DuckDB is an embedded analytical database that runs in-process inside ilum-core. It provides zero-overhead, single-node SQL execution for small-to-medium data and ad-hoc exploration. Combined with the DuckLake catalog, DuckDB is a first-class option for fast local analytics over object storage.

DuckDB is enabled by default in Ilum.

When to use DuckDB

DuckDB is the right engine for:

  • Quick queries on small-to-medium datasets.
  • Ad-hoc exploration where pod startup latency would be a bottleneck.
  • Analytics over DuckLake-managed tables.
  • Single-user, single-node workloads.
  • Rapid prototyping before scaling out to Spark or Trino.

For distributed workloads on large data, prefer Apache Spark. For interactive analytics on large data with concurrent users, prefer Trino.

Execution model

DuckDB runs in-process with ilum-core:

  • No driver pod, no executor pods, no network round-trips for query execution.
  • Single-node parallelism via DuckDB's vectorized execution engine.
  • Direct reads from object storage (MinIO, S3, GCS, Azure Blob, HDFS) without copying data into a cluster.

This model delivers sub-second response times on small queries that would otherwise be dominated by Spark or Trino startup overhead.

DuckLake catalog

DuckLake is a DuckDB-native catalog enabled by default in Ilum. Tables created through DuckLake are stored on S3-compatible object storage and accessible through DuckDB SQL with no additional configuration.

DuckLake is the default catalog for new DuckDB workloads. Hive Metastore tables remain accessible to DuckDB through standard catalog connectors.

Supported table formats

DuckDB reads and writes:

  • Parquet: Native, with predicate pushdown and zone maps.
  • CSV, JSON: Direct read with schema inference.
  • DuckLake-managed tables: ACID writes through DuckLake.
  • Delta Lake and Iceberg: Read access through DuckDB extensions.

Configuration

DuckDB and DuckLake are enabled out of the box. The relevant Helm values:

ilum-core:
sql:
duckdb:
enabled: true
idleTimeout: 1h
ducklake:
enabled: true

DuckLake table data is stored in MinIO (or any configured S3-compatible backend) at a path configurable through ilum-core.sql.duckdb.ducklake.path.

Extension management

The Ilum image ships with DuckDB extensions pre-baked so that runtime sessions never reach the DuckDB extension registry. Two mechanisms coexist:

  • Pre-populated extension cache at ~/.duckdb/extensions/ inside the ilum-core container. The standard extensions httpfs, iceberg, postgres_scanner, and ducklake are placed here at image build time. DuckDB's autoload mechanism picks them up transparently the first time a session touches an s3:// path, an Iceberg table, a Postgres ATTACH, or a DuckLake catalog — no INSTALL or LOAD is required, and no outbound call is made.
  • Local extension repository at /duckdbExt inside the container, holding hive_metastore and duck_lineage. These are loaded explicitly by Ilum when a Hive metastore or Marquez lineage backend is configured.
ExtensionSourceHow it loads
httpfsPre-populated cache (~/.duckdb)Autoload on first s3:// / https:// access or INSTALL httpfs; LOAD httpfs;.
icebergPre-populated cache (~/.duckdb)Autoload on first iceberg_scan(...) call or INSTALL iceberg; LOAD iceberg;.
postgres_scannerPre-populated cache (~/.duckdb)Autoload on first Postgres ATTACH (including DuckLake's catalog) or INSTALL postgres_scanner; LOAD postgres_scanner;.
ducklakePre-populated cache (~/.duckdb)Autoload on ATTACH 'ducklake:...' or INSTALL ducklake; LOAD ducklake;.
hive_metastoreLocal repository (/duckdbExt)Explicit INSTALL hive_metastore FROM '/duckdbExt'; LOAD hive_metastore; when a Hive metastore is configured.
duck_lineageLocal repository (/duckdbExt)Explicit INSTALL duck_lineage FROM '/duckdbExt'; LOAD duck_lineage; when Marquez is configured.

The bare DuckDB form INSTALL <extension_name>; LOAD <extension_name>; continues to work for all of these. For the cache-backed set DuckDB resolves locally; for hive_metastore and duck_lineage DuckDB would otherwise reach community-extensions.duckdb.org, so the explicit FROM '/duckdbExt' form is used internally in air-gapped and MITM-restricted deployments.

Adding custom extensions

To bundle extensions beyond the default set — a custom community extension, a private build, or a community extension that Ilum does not pre-stage — use ilum-core.sql.duckdb.extraExtensions. Exactly one source must be configured: a PersistentVolumeClaim (recommended) or a node hostPath. Files must follow the DuckDB layout v<duckdb-version>/<platform>/<name>.duckdb_extension (for example v1.5.1/linux_amd64/myext.duckdb_extension).

The full schema of sql.duckdb.extraExtensions.* is documented on the ilum-core chart parameters page on ArtifactHub.

Source comparison

SourceHolds the full v<ver>/<arch>/ tree?Notes
PersistentVolumeClaimYes — backed by a filesystem.Recommended for production. Multi-arch / multi-version friendly.
hostPathYes — backed by a node filesystem.Single-node only; binds the deployment to a specific node.

Example: PVC source

Create a small PVC in the same namespace as ilum-core:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ilum-duckdb-extra-extensions
namespace: ilum
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 200Mi

Populate it once via a temporary loader Pod that mounts the same claim. The Pod stays running long enough for the operator to copy files in, then is deleted:

apiVersion: v1
kind: Pod
metadata:
name: duckdb-ext-loader
namespace: ilum
spec:
restartPolicy: Never
containers:
- name: shell
image: busybox:1.36
command: ["sleep", "1800"]
volumeMounts:
- name: ext
mountPath: /data
volumes:
- name: ext
persistentVolumeClaim:
claimName: ilum-duckdb-extra-extensions

Copy the extension into the PVC and clean up:

kubectl -n ilum exec duckdb-ext-loader -- mkdir -p /data/v1.5.1/linux_amd64
kubectl -n ilum cp ./myext.duckdb_extension duckdb-ext-loader:/data/v1.5.1/linux_amd64/myext.duckdb_extension
kubectl -n ilum delete pod duckdb-ext-loader
note

The loader Pod must be deleted before ilum-core rolls out the new revision that mounts the PVC. ReadWriteOnce (the default access mode on most storage classes) allows only one Pod to attach the claim at a time.

Reference the PVC from helm_aio values:

ilum-core:
sql:
duckdb:
extraExtensions:
enabled: true
mountPath: "/duckdbExt-extra"
existingClaim: "ilum-duckdb-extra-extensions"

After helm upgrade, reference the extension from SQL:

INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;

Example: hostPath source

For single-node clusters (development, edge deployments) where a PVC is overkill, hostPath mounts a directory directly from the Kubernetes node's filesystem into the ilum-core container.

Prepare the directory on the node where ilum-core will run, with the extensions laid out in DuckDB's expected v<duckdb-version>/<platform>/ structure:

# On the Kubernetes node (e.g. via SSH):
sudo mkdir -p /srv/duckdb-extra/v1.5.1/linux_amd64
sudo cp ./myext.duckdb_extension /srv/duckdb-extra/v1.5.1/linux_amd64/
# Make the files readable by the ilum-core container (UID 1001 by default):
sudo chmod -R a+rX /srv/duckdb-extra

Reference the host path from helm values:

ilum-core:
sql:
duckdb:
extraExtensions:
enabled: true
mountPath: "/duckdbExt-extra"
hostPath: "/srv/duckdb-extra"

After helm upgrade, the SQL form is identical to the PVC case:

INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;
warning

hostPath mounts pin the deployment to the node holding the files. If ilum-core is rescheduled to a different node, the mount will fail and the Pod will not start. Use a node selector or affinity rule to keep ilum-core on the prepared node, or migrate to a PVC for multi-node clusters.

tip

This is the recommended mechanism in MITM-restricted or air-gapped environments where the DuckDB extension registry (extensions.duckdb.org) is unreachable. To make DuckDB's native HTTPS extensions (such as httpfs and aws) also trust an internal Certificate Authority in the same environment, follow the corporate MITM proxy walkthrough. For the broader deployment context, see the Air-gapped Installation Guide.

Selecting DuckDB in the SQL Editor

In the Ilum SQL Editor, the Engine Selector dropdown lets you choose DuckDB for any query. The engine status indicator confirms the in-process engine is ready.

When the automatic engine router is enabled, DuckDB is selected automatically for queries that target small datasets, DuckLake-managed tables, or ad-hoc exploration patterns.

Limitations

  • DuckDB is single-node; it does not scale horizontally across executors.
  • Query concurrency is bounded by the resources allocated to ilum-core.
  • Long-running queries should use Spark or Trino instead, both for resource isolation and for failure recovery.