Run dbt Core on Spark (Kubernetes)
This guide explains how to set up dbt Core with Apache Spark running on a Kubernetes cluster. Using Ilum as the execution engine, you can run scalable data transformation pipelines directly on your data lake.
You have two primary ways to connect dbt to Spark on Ilum:
Thrift Server vs. Spark Connect
| Feature | Method 1: Spark Thrift Server (Legacy) | Method 2: Spark Connect (Modern) |
|---|---|---|
| Protocol | JDBC/ODBC (via HiveDriver) | gRPC (via Spark Connect) |
| Connection Type | method: thrift | method: session |
| Architecture | Requires a dedicated Thrift Server pod | Connects directly to Spark Driver |
| Performance | Higher latency (row-based serialization) | High performance (Arrow-based) |
| Best For | BI Tools (Tableau, PowerBI), Legacy apps | Data Engineering, Python/dbt pipelines |
For a deep dive into the architecture, check out our Spark Connect on Kubernetes Guide.
Prerequisites
Before starting, ensure your development environment is ready:
- Kubernetes Cluster: You need a running K8s cluster (GKE, EKS, AKS, or Minikube).
- Tools:
- Helm (for deploying Ilum).
- kubectl (configured to access your cluster).
- Python 3.8+ (for running dbt Core).
- Knowledge: Basic understanding of dbt projects and Spark concepts.
How to Configure dbt with Spark on Kubernetes
Choose your preferred connection method:
- Method 1: Thrift Server
- Method 2: Spark Connect
Step 1: Deploy Spark Thrift Server
Deploy Ilum with the SQL module (acting as a scalable Thrift server) and Hive Metastore enabled:
helm repo add ilum https://charts.ilum.cloud
helm install ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-core.metastore.type=hive \
--set ilum-sql.enabled=true \
--set ilum-core.sql.enabled=true
Step 2: Connect to the Thrift Service
1. Identify the service:
kubectl get service
Find the service with "sql-thrift-binary" in its name.
2. Port-forward:
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009
This makes the Thrift server available at localhost:10009.
3. Test with Beeline (optional):
beeline -u "jdbc:hive2://localhost:10009/default"
Run:
SHOW TABLES;
Expect an empty list or existing tables.
Configuring and Running dbt
1. Clean Environment (if needed):
pip uninstall dbt-spark pyspark -y
2. Install dbt and dependencies:
pip install pyspark==3.5.7
pip install dbt-core
pip install "dbt-spark[PyHive,session]"
pip install --upgrade thrift
3. Verify installation:
dbt --version
Create dbt Project
1. Initialize a dbt project:
dbt init ilum_dbt_project
2. Answer the setup prompts:
Which database? 1 (spark)
host: localhost
Desired authentication method: 3 (thrift)
port: 10009
schema: default
threads: 1
This creates the ilum_dbt_project directory and a profiles.yml file in ~/.dbt/.
Configure dbt for Ilum
Edit ~/.dbt/profiles.yml to include both Thrift and Spark Connect targets:
ilum_dbt_project:
target: thrift # Default target
outputs:
thrift:
type: spark
method: thrift
host: localhost
port: 10009
schema: default
threads: 1
connect_retries: 5
connect_timeout: 60
connect_args:
url: "jdbc:hive2://localhost:10009/default;transportMode=binary;hive.server2.transport.mode=binary"
driver: "org.apache.hive.jdbc.HiveDriver"
auth: "NONE"
spark_connect:
type: spark
method: session
host: localhost
port: 15002
schema: default
threads: 1
Switch between targets:
# Use Thrift (default)
dbt run
# Use Spark Connect
dbt run --target spark_connect
# Or set default in dbt_project.yml
# target: spark_connect
-
Test connection:
Debug dbtcd ilum_dbt_project
dbt debugEnsure no errors appear, indicating a successful connection to the Thrift server.
Create a Model to Write Data
-
Create Model:
-
models/sample_data.sqlmodels/sample_data.sql{{ config(materialized='table') }}
SELECT
id,
name
FROM (
VALUES
(1, 'Alice'),
(2, 'Bob')
) AS t(id, name) -
Run Model:
Run sample_datadbt run --select sample_data
Create a Model to Read Data
-
Create Model:
-
models/read_data.sqlmodels/read_data.sql{{ config(materialized='table') }}
SELECT
id,
name,
LENGTH(name) AS name_length
FROM {{ ref('sample_data') }} -
Run Model:
Run read_datadbt run --select read_data
Verify Results
1. Monitor Job in Ilum UI:
-
Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward)
-
Navigate to the Jobs section
-
Look for the job named
ilum-sql-spark-engine -
Check job status, logs, and execution details to confirm successful processing
2. Query with Beeline:
beeline -u "jdbc:hive2://localhost:10009/default"3. Run query:
SELECT * FROM default.read_data;Expected output:
+----+-----+------------+
| id | name| name_length|
+----+-----+------------+
| 1 | Alice| 5 |
| 2 | Bob | 3 |
+----+-----+------------+
Spark Connect is the recommended way for modern data engineering teams to run dbt on Kubernetes. It eliminates the need for a heavy intermediate Thrift Server, reducing costs and complexity.
Step 1: Deploy Spark Connect Job
-
Log into Ilum UI
-
Navigate to Workloads → Jobs section
-
Click "New Job" button
-
Configure the job:
- Name:
spark-connect-dbt - Job Type: Select Spark Connect Job
- Name:
-
Add Spark Connect dependency (if needed):
Most Spark distributions don't include Spark Connect by default. You'll need to add it as a package dependency.
- Click the Configuration tab
- In the Parameters section, click Add Parameter
- Add the following parameter:
Key Value spark.jars.packagesorg.apache.spark:spark-connect_2.12:3.5.7noteReplace
2.12with your Scala version and3.5.7with your Spark version to match your environment. -
Click Submit
The server starts successfully when you see this in the logs:
Spark Connect server started at: 0:0:0:0:0:0:0:0%0:15002
Connecting to the Spark Connect Server
Get the Connection URL
After the job starts, Ilum provides a Spark Connect URL on the job details page.
The URL format is: sc://job-xxxxx-driver-svc:15002
Port-Forward for Local Access
To connect from your local machine, forward the driver pod's port:
-
Find the driver pod name from the Logs tab in Ilum UI
Example: If URL is
sc://job-20250807-1557-ablr2a52vxd-driver-svc:15002,
the pod name isjob-20250807-1557-ablr2a52vxd-driver(remove-svcsuffix) -
Port-forward:
Port Forwardkubectl port-forward <driver-pod-name> 15002:15002Keep this terminal window open.
Create dbt Project
Initialize a dbt project (if needed):
dbt init ilum_dbt_spark_connect_project
Answer the setup prompts:
Which database? 1 (spark)
host: localhost
Desired authentication method: 4 (session) #or 3 if u can't see session
port: 15002
schema: default
threads: 1
This creates the ilum_dbt_spark_connect_project directory and updates ~/.dbt/profiles.yml.
Configure dbt for Spark Connect
If you followed the Thrift setup above, your ~/.dbt/profiles.yml already has both targets configured. You can use the same ilum_dbt_project profile.
To use Spark Connect, simply specify the target:
cd ilum_dbt_project # Use the same project as Thrift
dbt debug --target spark_connect
dbt run --target spark_connect
Or create a separate project (if you prefer isolation):
Edit ~/.dbt/profiles.yml:
ilum_dbt_spark_connect_project:
target: dev
outputs:
dev:
type: spark
method: session
host: localhost
port: 15002
schema: default
threads: 1
Test the connection:
cd ..
cd ilum_dbt_spark_connect_project
dbt debug
Recommended approach: Use one dbt project with multiple targets (as shown in the Thrift section). This allows you to switch between Thrift and Spark Connect without maintaining separate projects.
You should see successful connection messages.
Create a Model to Write Data
-
Create Model:
-
models/sample_data_connect.sqlmodels/sample_data_connect.sql{{ config(materialized='table') }}
SELECT
id,
name
FROM (
VALUES
(1, 'Peter'),
(2, 'John')
) AS t(id, name) -
Run Model:
Run sample_data_connectdbt run --select sample_data_connect --target spark_connect
Create a Model to Read Data
-
Create Model:
-
models/read_data_connect.sqlmodels/read_data_connect.sql{{ config(materialized='table') }}
SELECT
id,
name,
LENGTH(name) AS name_length
FROM {{ ref('sample_data_connect') }} -
Run Model:
Run read_data_connectdbt run --select read_data_connect --target spark_connectnoteThe
--target spark_connectflag ensures dbt uses the Spark Connect configuration instead of the default Thrift target.
Verify Results
-
Monitor Job in Ilum UI:
- Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward).
- Navigate to the Jobs section.
- Look for the job named spark-connect.
- Check job status, logs, and execution details to confirm successful processing.
-
Print data in dbt Job: To verify the data landed in the Spark warehouse (e.g.,
spark-warehouse/read_data_connectrelative to your project directory), create a dbt macro and run a custom operation to query and print theread_data_connecttable’s contents during the dbt job.Create a macro file in your dbt project directory:
-
macros/print_table.sql:macros/print_table.sql{% macro print_table(table_name) %}
{% set query %}
SELECT * FROM {{ ref(table_name) }}
{% endset %}
{% do log('Printing table contents for ' ~ table_name ~ ':', True) %}
{% set results = run_query(query) %}
{% if results %}
{% for row in results %}
{% do log(row, True) %}
{% endfor %}
{% else %}
{% do log('No data found in ' ~ table_name, True) %}
{% endif %}
{% endmacro %}Run the macro to print the
read_data_connecttable after your dbt models:Run Macrodbt run-operation print_table --args '{"table_name": "read_data_connect"}'The
dbt run-operationcommand executes the macro, querying theread_data_connecttable and logging its contents. Expected output in the dbt logs or console:Printing table contents for read_data_connect:
<agate.Row: (2, 'John', 4)>
<agate.Row: (1, 'Peter', 5)>NoteThe output appears in the dbt logs or console by default in dbt 1.9.4. For more detailed logs, you can use:
Debug Macrodbt run-operation print_table --args '{"table_name": "read_data_connect"}' --log-level debug
-
Troubleshooting dbt-spark Connections
Common issues when connecting dbt to Spark on Kubernetes:
Error: "ThriftTransportException: Could not connect to localhost:10009"
Cause: The port forwarding tunnel is down or the Thrift Server pod is not running. Solution:
- Check if the Thrift pod is running:
kubectl get pods -l app.kubernetes.io/name=ilum-sql - Restart port-forwarding:
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009
Error: "grpc._channel._InactiveRpcError: failed to connect to all addresses"
Cause: Your local dbt client cannot reach the Spark Connect gRPC port (15002). Solution:
- Ensure you have port-forwarded the Driver Pod, not the Service (unless using NodePort).
- Verify you are using
method: sessioninprofiles.yml.
Error: "AnalysisException: Table or view not found"
Cause: Hive Metastore connectivity issue. Solution:
- Ensure
ilum-core.metastore.enabled=truewas set during Helm install. - Check if the schema (database) exists in Spark:
spark.sql("SHOW DATABASES").show()
Orchestration
For production orchestration using Apache Airflow, see the dedicated guide: Orchestrate dbt with Airflow