Skip to main content

Run dbt Core on Spark (Kubernetes)

This guide explains how to set up dbt Core with Apache Spark running on a Kubernetes cluster. Using Ilum as the execution engine, you can run scalable data transformation pipelines directly on your data lake.

You have two primary ways to connect dbt to Spark on Ilum:

Thrift Server vs. Spark Connect

FeatureMethod 1: Spark Thrift Server (Legacy)Method 2: Spark Connect (Modern)
ProtocolJDBC/ODBC (via HiveDriver)gRPC (via Spark Connect)
Connection Typemethod: thriftmethod: session
ArchitectureRequires a dedicated Thrift Server podConnects directly to Spark Driver
PerformanceHigher latency (row-based serialization)High performance (Arrow-based)
Best ForBI Tools (Tableau, PowerBI), Legacy appsData Engineering, Python/dbt pipelines
New to Spark Connect?

For a deep dive into the architecture, check out our Spark Connect on Kubernetes Guide.


Prerequisites

Before starting, ensure your development environment is ready:

  • Kubernetes Cluster: You need a running K8s cluster (GKE, EKS, AKS, or Minikube).
  • Tools:
  • Knowledge: Basic understanding of dbt projects and Spark concepts.

How to Configure dbt with Spark on Kubernetes

Choose your preferred connection method:

Step 1: Deploy Spark Thrift Server

Deploy Ilum with the SQL module (acting as a scalable Thrift server) and Hive Metastore enabled:

Helm Install
helm repo add ilum https://charts.ilum.cloud
helm install ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-core.metastore.type=hive \
--set ilum-sql.enabled=true \
--set ilum-core.sql.enabled=true

Step 2: Connect to the Thrift Service

1. Identify the service:

Get Service
kubectl get service

Find the service with "sql-thrift-binary" in its name.

2. Port-forward:

Port Forward
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

This makes the Thrift server available at localhost:10009.

3. Test with Beeline (optional):

beeline -u "jdbc:hive2://localhost:10009/default"

Run:

SHOW TABLES;

Expect an empty list or existing tables.

Configuring and Running dbt

1. Clean Environment (if needed):

Uninstall Conflict
pip uninstall dbt-spark pyspark -y

2. Install dbt and dependencies:

Install dbt-spark
pip install pyspark==3.5.7
pip install dbt-core
pip install "dbt-spark[PyHive,session]"
pip install --upgrade thrift

3. Verify installation:

Verify dbt
dbt --version

Create dbt Project

1. Initialize a dbt project:

Init Project
dbt init ilum_dbt_project

2. Answer the setup prompts:

Setup Prompts
Which database? 1 (spark)
host: localhost
Desired authentication method: 3 (thrift)
port: 10009
schema: default
threads: 1

This creates the ilum_dbt_project directory and a profiles.yml file in ~/.dbt/.

Configure dbt for Ilum

Edit ~/.dbt/profiles.yml to include both Thrift and Spark Connect targets:

~/.dbt/profiles.yml
ilum_dbt_project:
target: thrift # Default target
outputs:
thrift:
type: spark
method: thrift
host: localhost
port: 10009
schema: default
threads: 1
connect_retries: 5
connect_timeout: 60
connect_args:
url: "jdbc:hive2://localhost:10009/default;transportMode=binary;hive.server2.transport.mode=binary"
driver: "org.apache.hive.jdbc.HiveDriver"
auth: "NONE"

spark_connect:
type: spark
method: session
host: localhost
port: 15002
schema: default
threads: 1

Switch between targets:

Run dbt
# Use Thrift (default)
dbt run

# Use Spark Connect
dbt run --target spark_connect

# Or set default in dbt_project.yml
# target: spark_connect
  • Test connection:

    Debug dbt
    cd ilum_dbt_project
    dbt debug

    Ensure no errors appear, indicating a successful connection to the Thrift server.

Create a Model to Write Data

  • Create Model:

  • models/sample_data.sql

    models/sample_data.sql
    {{ config(materialized='table') }}

    SELECT
    id,
    name
    FROM (
    VALUES
    (1, 'Alice'),
    (2, 'Bob')
    ) AS t(id, name)
  • Run Model:

    Run sample_data
    dbt run --select sample_data

Create a Model to Read Data

  • Create Model:

  • models/read_data.sql

    models/read_data.sql
    {{ config(materialized='table') }}

    SELECT
    id,
    name,
    LENGTH(name) AS name_length
    FROM {{ ref('sample_data') }}
  • Run Model:

    Run read_data
    dbt run --select read_data

Verify Results

1. Monitor Job in Ilum UI:

  • Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward)

  • Navigate to the Jobs section

  • Look for the job named ilum-sql-spark-engine

  • Check job status, logs, and execution details to confirm successful processing

    2. Query with Beeline:

    beeline -u "jdbc:hive2://localhost:10009/default"

    3. Run query:

    SELECT * FROM default.read_data;

    Expected output:

    +----+-----+------------+
    | id | name| name_length|
    +----+-----+------------+
    | 1 | Alice| 5 |
    | 2 | Bob | 3 |
    +----+-----+------------+

Troubleshooting dbt-spark Connections

Common issues when connecting dbt to Spark on Kubernetes:

Error: "ThriftTransportException: Could not connect to localhost:10009"

Cause: The port forwarding tunnel is down or the Thrift Server pod is not running. Solution:

  1. Check if the Thrift pod is running: kubectl get pods -l app.kubernetes.io/name=ilum-sql
  2. Restart port-forwarding: kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009
Error: "grpc._channel._InactiveRpcError: failed to connect to all addresses"

Cause: Your local dbt client cannot reach the Spark Connect gRPC port (15002). Solution:

  • Ensure you have port-forwarded the Driver Pod, not the Service (unless using NodePort).
  • Verify you are using method: session in profiles.yml.
Error: "AnalysisException: Table or view not found"

Cause: Hive Metastore connectivity issue. Solution:

  • Ensure ilum-core.metastore.enabled=true was set during Helm install.
  • Check if the schema (database) exists in Spark: spark.sql("SHOW DATABASES").show()

Orchestration

For production orchestration using Apache Airflow, see the dedicated guide: Orchestrate dbt with Airflow