Power BI Integration with Apache Spark

Architecture Overview

Power BI can connect directly to distributed datasets managed by the ilum platform. This integration uses the Hive Thrift Protocol exposed by Ilum SQL (Apache Kyuubi), the same gateway used by Tableau.

The data flow is as follows:

Power BI Desktop connects using a JDBC or ODBC driver.
The connection is routed to the Ilum SQL service (ilum-sql-thrift-binary), which acts as the SQL gateway.
Ilum SQL manages a Spark Session for the connected user.
SQL queries generated by Power BI are executed as Spark Jobs on the cluster.
Spark Executors retrieve data from the underlying storage (S3, HDFS, GCS) using metadata from the Hive Metastore.

Power BI supports both JDBC and ODBC connection methods. ODBC is the recommended approach for Power BI, as it has native ODBC support and better compatibility than JDBC custom connectors.

Server-Side Configuration

The server-side setup is identical to Tableau. You must enable Ilum SQL and expose the Thrift service.

Enable Ilum SQL and Metastore

Ensure that ilum-sql (Kyuubi) and hive-metastore are enabled in your Helm release:

helm upgrade ilum ilum/ilum \
    --set ilum-hive-metastore.enabled=true \
    --set ilum-core.metastore.enabled=true \
    --set ilum-kyuubi.enabled=true \
    --reuse-values

Network Exposure Strategy

Power BI requires TCP access to the Ilum SQL Thrift interface on port 10009.

Production: LoadBalancer

For production environments, expose the Thrift binary service via a cloud LoadBalancer:

helm upgrade ilum ilum/ilum \
    --set ilum-kyuubi.services.thriftBinary.type="LoadBalancer" \
    --reuse-values

Verify the external IP assignment:

kubectl get service ilum-sql-thrift-binary

Development: Port Forwarding

For local development or testing without an external LoadBalancer, tunnel the connection using kubectl:

kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

Client-Side Configuration: Power BI Desktop

ODBC Connection (Recommended)

ODBC is the recommended connection method for Power BI, as it provides native integration without custom connector development.

1. Install the Simba Spark ODBC Driver

Download and install the Simba Spark ODBC Driver from the official download page. Choose the Windows 64-bit installer.

2. Configure an ODBC Data Source (DSN)

Open ODBC Data Source Administrator (64-bit) on Windows (search for "ODBC Data Sources" in the Start menu).
Under the System DSN or User DSN tab, click Add.
Select Simba Spark ODBC Driver and click Finish.
Configure the connection parameters:

Parameter	Value
Data Source Name	`Ilum Spark`
Host	LoadBalancer IP or `localhost` (if port-forwarding)
Port	`10009`
Authentication Mechanism	`No Authentication` for dev, or `Username and Password` for Ilum Internal Auth
Thrift Transport	`Binary`
Schema	`default`

Click Test to verify the connection, then OK to save.

3. Connect from Power BI Desktop

Open Power BI Desktop.
Click Get Data > More... > Other > ODBC.
Select the DSN you created (Ilum Spark) from the dropdown.
If authentication is enabled, enter your Ilum username and password when prompted.
The Navigator window will display available schemas and tables.

JDBC Connection (via Custom Connector)

JDBC connections in Power BI require a custom connector or a gateway with JDBC support. This method is more complex than ODBC and is typically used only when ODBC is not an option.

1. Download the JDBC Driver

Download the Kyuubi Hive JDBC Shaded JAR from Maven Central.

2. Configure the JDBC Connection

In Power BI Desktop, click Get Data > More... > Other > JDBC.
Enter the connection string:

jdbc:kyuubi://<HOST>:10009/default

Local Port-Forward: jdbc:kyuubi://localhost:10009/default
LoadBalancer: jdbc:kyuubi://<EXTERNAL_IP>:10009/default

Configure authentication as required by your ilum deployment.

note

Power BI's native JDBC support is limited. For production use, prefer the ODBC connection method or use the On-premises data gateway with JDBC support.

DirectQuery vs Import Mode

Power BI offers two data connectivity modes, each with distinct trade-offs when used with Spark.

Import Mode

Data is loaded from Spark into the Power BI in-memory engine (VertiPaq) at the time of refresh.

Advantages: Fast interactive performance after import; all Power BI features available; no dependency on cluster availability during report viewing.
Disadvantages: Data is only as fresh as the last refresh; large datasets may exceed Power BI memory limits.
Best for: Dashboards viewed frequently, datasets that fit within Power BI's memory constraints, and reports where real-time freshness is not critical.

DirectQuery

Every interaction in the report generates a live SQL query sent to Spark via Ilum SQL.

Advantages: Always shows the latest data; no data size limitations in Power BI.
Disadvantages: Higher latency per interaction (Spark query execution time); requires the ilum cluster to be available; some Power BI features may be restricted.
Best for: Very large datasets, real-time monitoring dashboards, and scenarios where data freshness is critical.

tip

For most analytical workloads, Import Mode with a scheduled refresh provides the best balance of performance and data freshness. Use DirectQuery only when you need real-time access to data that changes frequently.

Power BI Service (Cloud)

To publish reports connected to your ilum cluster to the Power BI Service (cloud), you need an On-premises data gateway to bridge the connection.

Gateway Setup

Download and install the On-premises data gateway on a machine that has network access to the ilum cluster (either directly or through a VPN).
Sign in with your Power BI Service account and register the gateway.
In the gateway configuration, add a new data source:
- Data Source Type: ODBC
- Connection String: DSN=Ilum Spark (the DSN configured on the gateway machine)
- Authentication: Configure credentials matching your ilum deployment.

Scheduled Refresh

Once the gateway is configured:

Publish your Power BI report to the Power BI Service.
Navigate to the dataset settings in the Power BI Service.
Under Gateway connection, select your on-premises gateway and map the data source.
Under Scheduled refresh, configure the refresh frequency (e.g., daily, hourly).

warning

Ensure the gateway machine maintains a stable network connection to the ilum cluster. If the gateway goes offline, scheduled refreshes and DirectQuery reports will fail.

Usage Workflow

Selecting Schema and Tables

After connecting, the Power BI Navigator displays available catalogs and schemas. Select spark_catalog and the target database (e.g., default). Ilum supports catalog federation, so tables from Iceberg, Delta Lake, and Hudi catalogs may all be visible.

Building Visualizations

Select the tables you want to analyze and click Load (Import Mode) or DirectQuery.
In the report canvas, drag fields from the Fields pane to build visualizations.
Columns are mapped from Spark SQL data types to Power BI types automatically.

Custom SQL for Optimized Queries

For complex analytics, use a custom SQL query instead of loading full tables:

In the Get Data dialog, after connecting, click Advanced options.
Enter an optimized SQL query that pre-aggregates or filters data on the Spark side:

SELECT
    date_trunc('day', event_time) AS event_day,
    category,
    COUNT(*) AS event_count,
    SUM(amount) AS total_amount
FROM spark_catalog.default.events
WHERE event_time >= current_date - INTERVAL 30 DAYS
GROUP BY 1, 2

This pushes the computational load to the ilum cluster, reducing the volume of data transferred to Power BI.

Troubleshooting

Driver Not Found

Symptom: Power BI reports "The ODBC driver is not installed" or fails to list the DSN.

Resolution:

Verify the Simba Spark ODBC driver is installed using the 64-bit installer (Power BI Desktop is 64-bit).
Confirm the DSN was created in ODBC Data Source Administrator (64-bit), not the 32-bit version.
Restart Power BI Desktop after driver installation.

Connection Timeout

Symptom: Connection hangs or times out during the initial connection or query execution.

Resolution:

Check network connectivity to the LoadBalancer IP on port 10009 using telnet or Test-NetConnection in PowerShell.
If using port-forward, ensure the terminal session running kubectl port-forward is still active.
Spark sessions may take time to initialize on the first connection. Increase the connection timeout in the ODBC DSN configuration (default is 60 seconds).
For large queries, increase the query timeout in the DSN settings or optimize the query to reduce execution time.

Authentication Errors

Symptom: "Authentication failed" or "Invalid credentials" error.

Resolution:

If ilum is configured with Internal Auth, verify you are using valid ilum credentials.
If No Authentication is configured (development mode), use anonymous as the username and leave the password blank.
Ensure the authentication mechanism in the ODBC DSN matches the ilum cluster configuration.

Architecture Overview​

Server-Side Configuration​

Enable Ilum SQL and Metastore​

Network Exposure Strategy​

Production: LoadBalancer​

Development: Port Forwarding​

Client-Side Configuration: Power BI Desktop​

ODBC Connection (Recommended)​

1. Install the Simba Spark ODBC Driver​

2. Configure an ODBC Data Source (DSN)​

3. Connect from Power BI Desktop​

JDBC Connection (via Custom Connector)​

1. Download the JDBC Driver​

2. Configure the JDBC Connection​

DirectQuery vs Import Mode​

Import Mode​

DirectQuery​

Power BI Service (Cloud)​

Gateway Setup​

Scheduled Refresh​

Usage Workflow​

Selecting Schema and Tables​

Building Visualizations​

Custom SQL for Optimized Queries​

Troubleshooting​

Driver Not Found​

Connection Timeout​

Authentication Errors​

Architecture Overview

Server-Side Configuration

Enable Ilum SQL and Metastore

Network Exposure Strategy

Production: LoadBalancer

Development: Port Forwarding

Client-Side Configuration: Power BI Desktop

ODBC Connection (Recommended)

1. Install the Simba Spark ODBC Driver

2. Configure an ODBC Data Source (DSN)

3. Connect from Power BI Desktop

JDBC Connection (via Custom Connector)

1. Download the JDBC Driver

2. Configure the JDBC Connection

DirectQuery vs Import Mode

Import Mode

DirectQuery

Power BI Service (Cloud)

Gateway Setup

Scheduled Refresh

Usage Workflow

Selecting Schema and Tables

Building Visualizations

Custom SQL for Optimized Queries

Troubleshooting

Driver Not Found

Connection Timeout

Authentication Errors