Run dbt Core on Spark (Kubernetes)

This guide explains how to set up dbt Core with Apache Spark running on a Kubernetes cluster. Using Ilum as the execution engine, you can run scalable data transformation pipelines directly on your data lake.

You have two primary ways to connect dbt to Spark on Ilum:

Thrift Server vs. Spark Connect

Feature	Method 1: Spark Thrift Server (Legacy)	Method 2: Spark Connect (Modern)
Protocol	JDBC/ODBC (via HiveDriver)	gRPC (via Spark Connect)
Connection Type	`method: thrift`	`method: session`
Architecture	Requires a dedicated Thrift Server pod	Connects directly to Spark Driver
Performance	Higher latency (row-based serialization)	High performance (Arrow-based)
Best For	BI Tools (Tableau, PowerBI), Legacy apps	Data Engineering, Python/dbt pipelines

New to Spark Connect?

For a deep dive into the architecture, check out our Spark Connect on Kubernetes Guide.

Prerequisites

Before starting, ensure your development environment is ready:

Kubernetes Cluster: You need a running K8s cluster (GKE, EKS, AKS, or Minikube).
Tools:
- Helm (for deploying Ilum).
- kubectl (configured to access your cluster).
- Python 3.8+ (for running dbt Core).
Knowledge: Basic understanding of dbt projects and Spark concepts.

How to Configure dbt with Spark on Kubernetes

Choose your preferred connection method:

Method 1: Thrift Server
Method 2: Spark Connect

Step 1: Deploy Spark Thrift Server

Deploy Ilum with the SQL module (acting as a scalable Thrift server) and Hive Metastore enabled:

Helm Install
helm repo add ilum https://charts.ilum.cloud
helm install ilum ilum/ilum \
  --set ilum-hive-metastore.enabled=true \
  --set ilum-core.metastore.enabled=true \
  --set ilum-core.metastore.type=hive \
  --set ilum-sql.enabled=true \
  --set ilum-core.sql.enabled=true

Step 2: Connect to the Thrift Service

1. Identify the service:

Get Service

kubectl get service

Find the service with "sql-thrift-binary" in its name.

2. Port-forward:

Port Forward
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

This makes the Thrift server available at localhost:10009.

3. Test with Beeline (optional):

beeline -u "jdbc:hive2://localhost:10009/default"

Run:

SHOW TABLES;

Expect an empty list or existing tables.

Configuring and Running dbt

1. Clean Environment (if needed):

Uninstall Conflict

pip uninstall dbt-spark pyspark -y

2. Install dbt and dependencies:

Install dbt-spark
pip install pyspark==3.5.8
pip install dbt-core
pip install "dbt-spark[PyHive,session]"
pip install --upgrade thrift

3. Verify installation:

Verify dbt

dbt --version

Create dbt Project

1. Initialize a dbt project:

Init Project

dbt init ilum_dbt_project

2. Answer the setup prompts:

Setup Prompts
Which database? 1 (spark)
host: localhost
Desired authentication method: 3 (thrift)
port: 10009
schema: default
threads: 1

This creates the ilum_dbt_project directory and a profiles.yml file in ~/.dbt/.

Configure dbt for Ilum

Edit ~/.dbt/profiles.yml to include both Thrift and Spark Connect targets:

~/.dbt/profiles.yml
ilum_dbt_project:
  target: thrift  # Default target
  outputs:
    thrift:
      type: spark
      method: thrift
      host: localhost
      port: 10009
      schema: default
      threads: 1
      connect_retries: 5
      connect_timeout: 60
      connect_args:
        url: "jdbc:hive2://localhost:10009/default;transportMode=binary;hive.server2.transport.mode=binary"
        driver: "org.apache.hive.jdbc.HiveDriver"
        auth: "NONE"
    
    spark_connect:
      type: spark
      method: session
      host: localhost
      port: 15002
      schema: default
      threads: 1

Switch between targets:

Run dbt
# Use Thrift (default)
dbt run

# Use Spark Connect
dbt run --target spark_connect

# Or set default in dbt_project.yml
# target: spark_connect

Test connection:
Debug dbt
```
cd ilum_dbt_project
dbt debug
```
Ensure no errors appear, indicating a successful connection to the Thrift server.

Create a Model to Write Data

Create Model:

models/sample_data.sql

models/sample_data.sql
{{ config(materialized='table') }}

SELECT 
  id,
  name
FROM (
  VALUES 
    (1, 'Alice'),
    (2, 'Bob')
) AS t(id, name)

Run Model:
Run sample_data
```
dbt run --select sample_data
```

Create a Model to Read Data

Create Model:

models/read_data.sql

models/read_data.sql
{{ config(materialized='table') }}

SELECT 
  id,
  name,
  LENGTH(name) AS name_length
FROM {{ ref('sample_data') }}

Run Model:
Run read_data
```
dbt run --select read_data
```

Verify Results

1. Monitor Job in Ilum UI:

Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward)
Navigate to the Jobs section
Look for the job named ilum-sql-spark-engine

Check job status, logs, and execution details to confirm successful processing

2. Query with Beeline:

beeline -u "jdbc:hive2://localhost:10009/default"

3. Run query:

SELECT * FROM default.read_data;

Expected output:

+----+-----+------------+
| id | name| name_length|
+----+-----+------------+
| 1  | Alice| 5         |
| 2  | Bob  | 3         |
+----+-----+------------+

Spark Connect is the recommended way for modern data engineering teams to run dbt on Kubernetes. It eliminates the need for a heavy intermediate Thrift Server, reducing costs and complexity.

Step 1: Deploy Spark Connect Job

Log into Ilum UI
Navigate to Workloads → Jobs section
Click "New Job" button
Configure the job:
- Name: spark-connect-dbt
- Job Type: Select Spark Connect Job
Add Spark Connect dependency (if needed):

Most Spark distributions don't include Spark Connect by default. You'll need to add it as a package dependency.
- Click the Configuration tab
- In the Parameters section, click Add Parameter
- Add the following parameter:
Key Value
spark.jars.packages org.apache.spark:spark-connect_2.12:3.5.8

note
Replace 2.12 with your Scala version and 3.5.8 with your Spark version to match your environment.
Click Submit

The server starts successfully when you see this in the logs:

Spark Connect server started at: 0:0:0:0:0:0:0:0%0:15002

Connecting to the Spark Connect Server

Get the Connection URL

After the job starts, Ilum provides a Spark Connect URL on the job details page.

The URL format is: sc://job-xxxxx-driver-svc:15002

Port-Forward for Local Access

To connect from your local machine, forward the driver pod's port:

Find the driver pod name from the Logs tab in Ilum UI

Example: If URL is sc://job-20250807-1557-ablr2a52vxd-driver-svc:15002,
the pod name is job-20250807-1557-ablr2a52vxd-driver (remove -svc suffix)
Port-forward:
Port Forward
```
kubectl port-forward <driver-pod-name> 15002:15002
```
Keep this terminal window open.

Create dbt Project

Initialize a dbt project (if needed):

Init Connect Project

dbt init ilum_dbt_spark_connect_project

Answer the setup prompts:

Setup Prompts
Which database? 1 (spark)
host: localhost
Desired authentication method: 4 (session) #or 3 if u can't see session
port: 15002
schema: default
threads: 1

This creates the ilum_dbt_spark_connect_project directory and updates ~/.dbt/profiles.yml.

Configure dbt for Spark Connect

If you followed the Thrift setup above, your ~/.dbt/profiles.yml already has both targets configured. You can use the same ilum_dbt_project profile.

To use Spark Connect, simply specify the target:

Run Connect
cd ilum_dbt_project  # Use the same project as Thrift
dbt debug --target spark_connect
dbt run --target spark_connect

Or create a separate project (if you prefer isolation):

Edit ~/.dbt/profiles.yml:

~/.dbt/profiles.yml
ilum_dbt_spark_connect_project:
  target: dev
  outputs:
    dev:
      type: spark
      method: session
      host: localhost
      port: 15002
      schema: default
      threads: 1

Test the connection:

Debug Connect
cd ..
cd ilum_dbt_spark_connect_project
dbt debug

tip

Recommended approach: Use one dbt project with multiple targets (as shown in the Thrift section). This allows you to switch between Thrift and Spark Connect without maintaining separate projects.

You should see successful connection messages.

Create a Model to Write Data

Create Model:

models/sample_data_connect.sql

models/sample_data_connect.sql
{{ config(materialized='table') }}

SELECT 
  id,
  name
FROM (
  VALUES 
    (1, 'Peter'),
    (2, 'John')
) AS t(id, name)

Run Model:

Run sample_data_connect
dbt run --select sample_data_connect --target spark_connect

Create a Model to Read Data

Create Model:

models/read_data_connect.sql

models/read_data_connect.sql
{{ config(materialized='table') }}

SELECT 
  id,
  name,
  LENGTH(name) AS name_length
FROM {{ ref('sample_data_connect') }}

Run Model:
Run read_data_connect
```
dbt run --select read_data_connect --target spark_connect
```
note
The --target spark_connect flag ensures dbt uses the Spark Connect configuration instead of the default Thrift target.

Verify Results

Monitor Job in Ilum UI:
- Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward).
- Navigate to the Jobs section.
- Look for the job named spark-connect.
- Check job status, logs, and execution details to confirm successful processing.

Print data in dbt Job: To verify the data landed in the Spark warehouse (e.g., spark-warehouse/read_data_connect relative to your project directory), create a dbt macro and run a custom operation to query and print the read_data_connect table’s contents during the dbt job.

Create a macro file in your dbt project directory:

macros/print_table.sql:

macros/print_table.sql
{% macro print_table(table_name) %}
  {% set query %}
    SELECT * FROM {{ ref(table_name) }}
  {% endset %}
  {% do log('Printing table contents for ' ~ table_name ~ ':', True) %}
  {% set results = run_query(query) %}
  {% if results %}
    {% for row in results %}
      {% do log(row, True) %}
    {% endfor %}
  {% else %}
    {% do log('No data found in ' ~ table_name, True) %}
  {% endif %}
{% endmacro %}

Run the macro to print the read_data_connect table after your dbt models:

Run Macro
dbt run-operation print_table --args '{"table_name": "read_data_connect"}'

The dbt run-operation command executes the macro, querying the read_data_connect table and logging its contents. Expected output in the dbt logs or console:

Printing table contents for read_data_connect:
<agate.Row: (2, 'John', 4)>                                           
<agate.Row: (1, 'Peter', 5)>

Note

The output appears in the dbt logs or console by default in dbt 1.9.4. For more detailed logs, you can use:

Debug Macro
dbt run-operation print_table --args '{"table_name": "read_data_connect"}' --log-level debug

Troubleshooting dbt-spark Connections

Common issues when connecting dbt to Spark on Kubernetes:

Error: "ThriftTransportException: Could not connect to localhost:10009"

Cause: The port forwarding tunnel is down or the Thrift Server pod is not running. Solution:

Check if the Thrift pod is running: kubectl get pods -l app.kubernetes.io/name=ilum-sql
Restart port-forwarding: kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

Error: "grpc._channel._InactiveRpcError: failed to connect to all addresses"

Cause: Your local dbt client cannot reach the Spark Connect gRPC port (15002). Solution:

Ensure you have port-forwarded the Driver Pod, not the Service (unless using NodePort).
Verify you are using method: session in profiles.yml.

Error: "AnalysisException: Table or view not found"

Cause: Hive Metastore connectivity issue. Solution:

Ensure ilum-core.metastore.enabled=true was set during Helm install.
Check if the schema (database) exists in Spark: spark.sql("SHOW DATABASES").show()

Orchestration

For production orchestration using Apache Airflow, see the dedicated guide: Orchestrate dbt with Airflow

Thrift Server vs. Spark Connect​

Prerequisites​

How to Configure dbt with Spark on Kubernetes​

Step 1: Deploy Spark Thrift Server​

Step 2: Connect to the Thrift Service​

Configuring and Running dbt​

Create dbt Project​

Configure dbt for Ilum​

Create a Model to Write Data​

Create a Model to Read Data​

Verify Results​

Step 1: Deploy Spark Connect Job​

Connecting to the Spark Connect Server​

Get the Connection URL​

Port-Forward for Local Access​

Create dbt Project​

Configure dbt for Spark Connect​

Create a Model to Write Data​

Create a Model to Read Data​

Verify Results​

Troubleshooting dbt-spark Connections​

Orchestration​

Thrift Server vs. Spark Connect

Prerequisites

How to Configure dbt with Spark on Kubernetes

Step 1: Deploy Spark Thrift Server

Step 2: Connect to the Thrift Service

Configuring and Running dbt

Create dbt Project

Configure dbt for Ilum

Create a Model to Write Data

Create a Model to Read Data

Verify Results

Step 1: Deploy Spark Connect Job

Connecting to the Spark Connect Server

Get the Connection URL

Port-Forward for Local Access

Create dbt Project

Configure dbt for Spark Connect

Create a Model to Write Data

Create a Model to Read Data

Verify Results

Troubleshooting dbt-spark Connections

Orchestration