Power BI Integration with Apache Spark
Architecture Overview
Power BI can connect directly to distributed datasets managed by the ilum platform. This integration uses the Hive Thrift Protocol exposed by Ilum SQL (Apache Kyuubi), the same gateway used by Tableau.
The data flow is as follows:
- Power BI Desktop connects using a JDBC or ODBC driver.
- The connection is routed to the Ilum SQL service (
ilum-sql-thrift-binary), which acts as the SQL gateway. - Ilum SQL manages a Spark Session for the connected user.
- SQL queries generated by Power BI are executed as Spark Jobs on the cluster.
- Spark Executors retrieve data from the underlying storage (S3, HDFS, GCS) using metadata from the Hive Metastore.
Power BI supports both JDBC and ODBC connection methods. ODBC is the recommended approach for Power BI, as it has native ODBC support and better compatibility than JDBC custom connectors.
Server-Side Configuration
The server-side setup is identical to Tableau. You must enable Ilum SQL and expose the Thrift service.
Enable Ilum SQL and Metastore
Ensure that ilum-sql (Kyuubi) and hive-metastore are enabled in your Helm release:
helm upgrade ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-kyuubi.enabled=true \
--reuse-values
Network Exposure Strategy
Power BI requires TCP access to the Ilum SQL Thrift interface on port 10009.
Production: LoadBalancer
For production environments, expose the Thrift binary service via a cloud LoadBalancer:
helm upgrade ilum ilum/ilum \
--set ilum-kyuubi.services.thriftBinary.type="LoadBalancer" \
--reuse-values
Verify the external IP assignment:
kubectl get service ilum-sql-thrift-binary
Development: Port Forwarding
For local development or testing without an external LoadBalancer, tunnel the connection using kubectl:
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009
Client-Side Configuration: Power BI Desktop
ODBC Connection (Recommended)
ODBC is the recommended connection method for Power BI, as it provides native integration without custom connector development.
1. Install the Simba Spark ODBC Driver
Download and install the Simba Spark ODBC Driver from the official download page. Choose the Windows 64-bit installer.
2. Configure an ODBC Data Source (DSN)
- Open ODBC Data Source Administrator (64-bit) on Windows (search for "ODBC Data Sources" in the Start menu).
- Under the System DSN or User DSN tab, click Add.
- Select Simba Spark ODBC Driver and click Finish.
- Configure the connection parameters:
| Parameter | Value |
|---|---|
| Data Source Name | Ilum Spark |
| Host | LoadBalancer IP or localhost (if port-forwarding) |
| Port | 10009 |
| Authentication Mechanism | No Authentication for dev, or Username and Password for Ilum Internal Auth |
| Thrift Transport | Binary |
| Schema | default |
- Click Test to verify the connection, then OK to save.
3. Connect from Power BI Desktop
- Open Power BI Desktop.
- Click Get Data > More... > Other > ODBC.
- Select the DSN you created (
Ilum Spark) from the dropdown. - If authentication is enabled, enter your Ilum username and password when prompted.
- The Navigator window will display available schemas and tables.
JDBC Connection (via Custom Connector)
JDBC connections in Power BI require a custom connector or a gateway with JDBC support. This method is more complex than ODBC and is typically used only when ODBC is not an option.
1. Download the JDBC Driver
Download the Kyuubi Hive JDBC Shaded JAR from Maven Central.
2. Configure the JDBC Connection
- In Power BI Desktop, click Get Data > More... > Other > JDBC.
- Enter the connection string:
jdbc:kyuubi://<HOST>:10009/default
- Local Port-Forward:
jdbc:kyuubi://localhost:10009/default - LoadBalancer:
jdbc:kyuubi://<EXTERNAL_IP>:10009/default
- Configure authentication as required by your ilum deployment.
Power BI's native JDBC support is limited. For production use, prefer the ODBC connection method or use the On-premises data gateway with JDBC support.
DirectQuery vs Import Mode
Power BI offers two data connectivity modes, each with distinct trade-offs when used with Spark.
Import Mode
Data is loaded from Spark into the Power BI in-memory engine (VertiPaq) at the time of refresh.
- Advantages: Fast interactive performance after import; all Power BI features available; no dependency on cluster availability during report viewing.
- Disadvantages: Data is only as fresh as the last refresh; large datasets may exceed Power BI memory limits.
- Best for: Dashboards viewed frequently, datasets that fit within Power BI's memory constraints, and reports where real-time freshness is not critical.
DirectQuery
Every interaction in the report generates a live SQL query sent to Spark via Ilum SQL.
- Advantages: Always shows the latest data; no data size limitations in Power BI.
- Disadvantages: Higher latency per interaction (Spark query execution time); requires the ilum cluster to be available; some Power BI features may be restricted.
- Best for: Very large datasets, real-time monitoring dashboards, and scenarios where data freshness is critical.
For most analytical workloads, Import Mode with a scheduled refresh provides the best balance of performance and data freshness. Use DirectQuery only when you need real-time access to data that changes frequently.
Power BI Service (Cloud)
To publish reports connected to your ilum cluster to the Power BI Service (cloud), you need an On-premises data gateway to bridge the connection.
Gateway Setup
- Download and install the On-premises data gateway on a machine that has network access to the ilum cluster (either directly or through a VPN).
- Sign in with your Power BI Service account and register the gateway.
- In the gateway configuration, add a new data source:
- Data Source Type: ODBC
- Connection String:
DSN=Ilum Spark(the DSN configured on the gateway machine) - Authentication: Configure credentials matching your ilum deployment.
Scheduled Refresh
Once the gateway is configured:
- Publish your Power BI report to the Power BI Service.
- Navigate to the dataset settings in the Power BI Service.
- Under Gateway connection, select your on-premises gateway and map the data source.
- Under Scheduled refresh, configure the refresh frequency (e.g., daily, hourly).
Ensure the gateway machine maintains a stable network connection to the ilum cluster. If the gateway goes offline, scheduled refreshes and DirectQuery reports will fail.
Usage Workflow
Selecting Schema and Tables
After connecting, the Power BI Navigator displays available catalogs and schemas. Select spark_catalog and the target database (e.g., default). Ilum supports catalog federation, so tables from Iceberg, Delta Lake, and Hudi catalogs may all be visible.
Building Visualizations
- Select the tables you want to analyze and click Load (Import Mode) or DirectQuery.
- In the report canvas, drag fields from the Fields pane to build visualizations.
- Columns are mapped from Spark SQL data types to Power BI types automatically.
Custom SQL for Optimized Queries
For complex analytics, use a custom SQL query instead of loading full tables:
- In the Get Data dialog, after connecting, click Advanced options.
- Enter an optimized SQL query that pre-aggregates or filters data on the Spark side:
SELECT
date_trunc('day', event_time) AS event_day,
category,
COUNT(*) AS event_count,
SUM(amount) AS total_amount
FROM spark_catalog.default.events
WHERE event_time >= current_date - INTERVAL 30 DAYS
GROUP BY 1, 2
This pushes the computational load to the ilum cluster, reducing the volume of data transferred to Power BI.
Troubleshooting
Driver Not Found
Symptom: Power BI reports "The ODBC driver is not installed" or fails to list the DSN.
Resolution:
- Verify the Simba Spark ODBC driver is installed using the 64-bit installer (Power BI Desktop is 64-bit).
- Confirm the DSN was created in ODBC Data Source Administrator (64-bit), not the 32-bit version.
- Restart Power BI Desktop after driver installation.
Connection Timeout
Symptom: Connection hangs or times out during the initial connection or query execution.
Resolution:
- Check network connectivity to the LoadBalancer IP on port 10009 using
telnetorTest-NetConnectionin PowerShell. - If using
port-forward, ensure the terminal session runningkubectl port-forwardis still active. - Spark sessions may take time to initialize on the first connection. Increase the connection timeout in the ODBC DSN configuration (default is 60 seconds).
- For large queries, increase the query timeout in the DSN settings or optimize the query to reduce execution time.
Authentication Errors
Symptom: "Authentication failed" or "Invalid credentials" error.
Resolution:
- If ilum is configured with Internal Auth, verify you are using valid ilum credentials.
- If No Authentication is configured (development mode), use
anonymousas the username and leave the password blank. - Ensure the authentication mechanism in the ODBC DSN matches the ilum cluster configuration.