DuckDB

DuckDB is an embedded analytical database that runs in-process inside ilum-core. It provides zero-overhead, single-node SQL execution for small-to-medium data and ad-hoc exploration. Combined with the DuckLake catalog, DuckDB is a first-class option for fast local analytics over object storage.

DuckDB is enabled by default in Ilum.

When to use DuckDB

DuckDB is the right engine for:

Quick queries on small-to-medium datasets.
Ad-hoc exploration where pod startup latency would be a bottleneck.
Analytics over DuckLake-managed tables.
Single-user, single-node workloads.
Rapid prototyping before scaling out to Spark or Trino.

For distributed workloads on large data, prefer Apache Spark. For interactive analytics on large data with concurrent users, prefer Trino.

Execution model

DuckDB runs in-process with ilum-core:

No driver pod, no executor pods, no network round-trips for query execution.
Single-node parallelism via DuckDB's vectorized execution engine.
Direct reads from object storage (MinIO, S3, GCS, Azure Blob, HDFS) without copying data into a cluster.

This model delivers sub-second response times on small queries that would otherwise be dominated by Spark or Trino startup overhead.

DuckLake catalog

DuckLake is a DuckDB-native catalog enabled by default in Ilum. Tables created through DuckLake are stored on S3-compatible object storage and accessible through DuckDB SQL with no additional configuration.

DuckLake is the default catalog for new DuckDB workloads. Hive Metastore tables remain accessible to DuckDB through standard catalog connectors.

Supported table formats

DuckDB reads and writes:

Parquet: Native, with predicate pushdown and zone maps.
CSV, JSON: Direct read with schema inference.
DuckLake-managed tables: ACID writes through DuckLake.
Delta Lake and Iceberg: Read access through DuckDB extensions.

Configuration

DuckDB and DuckLake are enabled out of the box. The relevant Helm values:

ilum-core:
  sql:
    duckdb:
      enabled: true
      idleTimeout: 1h
      ducklake:
        enabled: true

DuckLake table data is stored in MinIO (or any configured S3-compatible backend) at a path configurable through ilum-core.sql.duckdb.ducklake.path.

Extension management

The Ilum image ships with DuckDB extensions pre-baked so that runtime sessions never reach the DuckDB extension registry. Two mechanisms coexist:

Pre-populated extension cache at ~/.duckdb/extensions/ inside the ilum-core container. The standard extensions httpfs, iceberg, postgres_scanner, and ducklake are placed here at image build time. DuckDB's autoload mechanism picks them up transparently the first time a session touches an s3:// path, an Iceberg table, a Postgres ATTACH, or a DuckLake catalog — no INSTALL or LOAD is required, and no outbound call is made.
Local extension repository at /duckdbExt inside the container, holding hive_metastore and duck_lineage. These are loaded explicitly by Ilum when a Hive metastore or Marquez lineage backend is configured.

Extension	Source	How it loads
`httpfs`	Pre-populated cache (`~/.duckdb`)	Autoload on first `s3://` / `https://` access or `INSTALL httpfs; LOAD httpfs;`.
`iceberg`	Pre-populated cache (`~/.duckdb`)	Autoload on first `iceberg_scan(...)` call or `INSTALL iceberg; LOAD iceberg;`.
`postgres_scanner`	Pre-populated cache (`~/.duckdb`)	Autoload on first Postgres `ATTACH` (including DuckLake's catalog) or `INSTALL postgres_scanner; LOAD postgres_scanner;`.
`ducklake`	Pre-populated cache (`~/.duckdb`)	Autoload on `ATTACH 'ducklake:...'` or `INSTALL ducklake; LOAD ducklake;`.
`hive_metastore`	Local repository (`/duckdbExt`)	Explicit `INSTALL hive_metastore FROM '/duckdbExt'; LOAD hive_metastore;` when a Hive metastore is configured.
`duck_lineage`	Local repository (`/duckdbExt`)	Explicit `INSTALL duck_lineage FROM '/duckdbExt'; LOAD duck_lineage;` when Marquez is configured.

The bare DuckDB form INSTALL <extension_name>; LOAD <extension_name>; continues to work for all of these. For the cache-backed set DuckDB resolves locally; for hive_metastore and duck_lineage DuckDB would otherwise reach community-extensions.duckdb.org, so the explicit FROM '/duckdbExt' form is used internally in air-gapped and MITM-restricted deployments.

Adding custom extensions

To bundle extensions beyond the default set — a custom community extension, a private build, or a community extension that Ilum does not pre-stage — use ilum-core.sql.duckdb.extraExtensions. Exactly one source must be configured: a PersistentVolumeClaim (recommended) or a node hostPath. Files must follow the DuckDB layout v<duckdb-version>/<platform>/<name>.duckdb_extension (for example v1.5.1/linux_amd64/myext.duckdb_extension).

The full schema of sql.duckdb.extraExtensions.* is documented on the ilum-core chart parameters page on ArtifactHub.

Source comparison

Source	Holds the full `v<ver>/<arch>/` tree?	Notes
`PersistentVolumeClaim`	Yes — backed by a filesystem.	Recommended for production. Multi-arch / multi-version friendly.
`hostPath`	Yes — backed by a node filesystem.	Single-node only; binds the deployment to a specific node.

Example: PVC source

Create a small PVC in the same namespace as ilum-core:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ilum-duckdb-extra-extensions
  namespace: ilum
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 200Mi

Populate it once via a temporary loader Pod that mounts the same claim. The Pod stays running long enough for the operator to copy files in, then is deleted:

apiVersion: v1
kind: Pod
metadata:
  name: duckdb-ext-loader
  namespace: ilum
spec:
  restartPolicy: Never
  containers:
    - name: shell
      image: busybox:1.36
      command: ["sleep", "1800"]
      volumeMounts:
        - name: ext
          mountPath: /data
  volumes:
    - name: ext
      persistentVolumeClaim:
        claimName: ilum-duckdb-extra-extensions

Copy the extension into the PVC and clean up:

kubectl -n ilum exec duckdb-ext-loader -- mkdir -p /data/v1.5.1/linux_amd64
kubectl -n ilum cp ./myext.duckdb_extension duckdb-ext-loader:/data/v1.5.1/linux_amd64/myext.duckdb_extension
kubectl -n ilum delete pod duckdb-ext-loader

note

The loader Pod must be deleted before ilum-core rolls out the new revision that mounts the PVC. ReadWriteOnce (the default access mode on most storage classes) allows only one Pod to attach the claim at a time.

Reference the PVC from helm_aio values:

ilum-core:
  sql:
    duckdb:
      extraExtensions:
        enabled: true
        mountPath: "/duckdbExt-extra"
        existingClaim: "ilum-duckdb-extra-extensions"

After helm upgrade, reference the extension from SQL:

INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;

Example: hostPath source

For single-node clusters (development, edge deployments) where a PVC is overkill, hostPath mounts a directory directly from the Kubernetes node's filesystem into the ilum-core container.

Prepare the directory on the node where ilum-core will run, with the extensions laid out in DuckDB's expected v<duckdb-version>/<platform>/ structure:

# On the Kubernetes node (e.g. via SSH):
sudo mkdir -p /srv/duckdb-extra/v1.5.1/linux_amd64
sudo cp ./myext.duckdb_extension /srv/duckdb-extra/v1.5.1/linux_amd64/
# Make the files readable by the ilum-core container (UID 1001 by default):
sudo chmod -R a+rX /srv/duckdb-extra

Reference the host path from helm values:

ilum-core:
  sql:
    duckdb:
      extraExtensions:
        enabled: true
        mountPath: "/duckdbExt-extra"
        hostPath: "/srv/duckdb-extra"

After helm upgrade, the SQL form is identical to the PVC case:

INSTALL myext FROM '/duckdbExt-extra';
LOAD myext;

warning

hostPath mounts pin the deployment to the node holding the files. If ilum-core is rescheduled to a different node, the mount will fail and the Pod will not start. Use a node selector or affinity rule to keep ilum-core on the prepared node, or migrate to a PVC for multi-node clusters.

tip

This is the recommended mechanism in MITM-restricted or air-gapped environments where the DuckDB extension registry (extensions.duckdb.org) is unreachable. To make DuckDB's native HTTPS extensions (such as httpfs and aws) also trust an internal Certificate Authority in the same environment, follow the corporate MITM proxy walkthrough. For the broader deployment context, see the Air-gapped Installation Guide.

Selecting DuckDB in the SQL Editor

In the Ilum SQL Editor, the Engine Selector dropdown lets you choose DuckDB for any query. The engine status indicator confirms the in-process engine is ready.

When the automatic engine router is enabled, DuckDB is selected automatically for queries that target small datasets, DuckLake-managed tables, or ad-hoc exploration patterns.

Limitations

DuckDB is single-node; it does not scale horizontally across executors.
Query concurrency is bounded by the resources allocated to ilum-core.
Long-running queries should use Spark or Trino instead, both for resource isolation and for failure recovery.

When to use DuckDB​

Execution model​

DuckLake catalog​

Supported table formats​

Configuration​

Extension management​

Adding custom extensions​

Source comparison​

Example: PVC source​

Example: hostPath source​

Selecting DuckDB in the SQL Editor​

Limitations​

Related pages​