SQL Viewer
Ilum's SQL Viewer lets you run your SQL queries directly in the UI, without the need of writing any Scala or Python code. Designed for simplicity, it offers an intuitive interface for running SQL queries, exploring data, and gaining insights quickly and efficiently.
Whether you're prototyping, analyzing data, or debugging, the SQL Viewer allows you to focus solely on the query logic without managing the underlying code.
It is highly configurable through the UI or Helm deployment values, allowing flexibility in, for example, choosing a different table format, like Delta Lake, Apache Hudi, or Apache Iceberg.
How will it help you?
The SQL Viewer is a powerful tool for reporting and debugging during Spark application development. Instead of building an entire Spark SQL program to query your tables, you can submit SQL statements directly within Ilum's interface.
For debugging, SQL Viewer is invaluable—eliminating the need to repeatedly write, compile and submit code like:
val dataset = spark.sql("select ...")
Instead, you can interactively test SQL statements without re-creating sessions each time.
Beyond query results, SQL Viewer offers data exploration and visualization tools, along with logs and execution statistics, giving you deeper insights into the query process.
With integration into the Apache Hive Metastore, SQL Viewer provides an accessible, shared environment for running SQL queries—perfectly suited for collaboration across Ilum components.
Deployment
To use the SQL Viewer, you need to deploy Ilum with both SQL Viewer and Hive Metastore enabled. For setup instructions, refer to the Production page.
Once configured, SQL Viewer should be accessible from the sidebar and connected to the Hive Metastore.
Get started with the SQL viewer
-
Create a sample table
CREATE
OR REPLACE TABLE my_health_data (
timestamp TIMESTAMP,
heartrate INT,
handmotion INT,
sleepphase VARCHAR(20)
); -
Insert values
INSERT INTO my_health_data (timestamp, heartrate, handmotion, sleepphase)
VALUES (CAST('2024-10-01 00:00:00' AS TIMESTAMP), 70, 1, 'awake'),
(CAST('2024-10-01 01:00:00' AS TIMESTAMP), 68, 0, 'light'),
(CAST('2024-10-01 02:00:00' AS TIMESTAMP), 65, 0, 'deep'),
(CAST('2024-10-01 03:00:00' AS TIMESTAMP), 64, 1, 'deep'),
(CAST('2024-10-01 04:00:00' AS TIMESTAMP), 66, 0, 'light'),
(CAST('2024-10-01 05:00:00' AS TIMESTAMP), 72, 1, 'awake'),
... -
Run a SELECT query
SELECT sleepphase, avg(heartrate)
FROM my_health_data
GROUP BY sleepphase -
Observe tables and views in the "resources" tab
-
See the results of a query
-
Use the data exploration tool
-
Examine logs and statistics
Statistics provide information about time spent on query execution and may be useful for query optimization.
Logs can be useful for query debugging and general problem detection.
Tips and Good practices
Using external SQL clients with JDBC
While SQL Viewer works purely in the browser, you can also connect your Ilum SQL to external environments that support Hive / Spark JDBC drivers. This way, you can run queries on your Ilum data from your favorite SQL client or BI tool.
To do this, set up a JDBC connection with the following connection string:
jdbc:hive2://<ilum-sql-host>:<thrift-binary-port>/<database-name>;?spark.ilum.sql.cluster.id=<cluster-id>
Where:
<ilum-sql-host>
is the hostname of your Ilum SQL service component (e.g.,ilum-sql-headless
orilum-sql-thrift-binary
)<thrift-binary-port>
is the port on which the Thrift service is exposed (default is10009
)<database-name>
is the name of the database you want to connect to (putdefault
there)<cluster-id>
is the ID of the Spark cluster you want to connect to (you can find it in the Ilum Ui)
Using the SQL Viewer on datasets from other Ilum's components
SQL Viewer creates tables in the Hive Metastore by default, which means that any Spark resources, which also use the same metastore, will be able to see these tables.
A job which uses a metastore normally has the following configuration:
spark.sql.catalogImplementation=hive
spark.hadoop.hive.metastore.uris=thrift://<hive-metastore-host>:<hive-metastore-port>
# storage location for the metastore (e.g., s3)
spark.hadoop.fs.s3a.access.key=<access-key>
spark.hadoop.fs.s3a.secret.key=<secret-key>
spark.hadoop.fs.s3a.endpoint=<s3-endpoint>
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.fast.upload=true
We can see that the uri of the metastore is provided and a shared storage location is used.

An example of a component that also uses the metastore is The Ilum Table Explorer.
Make use of Spark's optimization elements
For example:
CREATE TABLE page_view
(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
friends ARRAY<BIGINT>,
properties MAP<STRING,
STRING>
) COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;
By partitioning tables, Hive Metastore during query handling will be able to filter out non-relevant partitions speeding up query processing. Similarly, different data formats can be set (see below).
Save Queries, import and export SQL queries
The SQL Viewer allows you to save your queries for later use, as well as import and export them.
Make use of More Advanced Data Formats: Apache Iceberg, Apache Hudi and Delta Tables
Delta Tables
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It uses Delta Tables as its core format, combining Parquet files with transaction logs. These logs keep track of table versions, capture DML operations, and handle concurrency with locks.
In simple terms, each operation on a partition creates a new version, preserving the previous one for easy rollback if needed. A transaction log file helps these versions to manage locking and version control, ensuring data consistency.
Delta Lake is straightforward to use and an excellent choice for integrating adding additional features to your Ilum environment.
For more details on Delta Lake's concurrency management with OCC (Optimistic Concurrency Control), and how it integrates streaming and batch writes, we recommend visiting the Delta Lake Documentation.
Features:
- ACID properties maintenance
- support for update, delete, and merge operations
- schema evolution: altering tables
- versioning: ability to time-travel to previous versions of the dataset
- better optimization in comparison to traditional formats
- integration of these features for both streaming and batches
How to use Delta Tables?
By default, Delta Tables are enabled inside Ilum's SQL Viewer. The following configurations are set for you:
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.catalog.update.enabled=true
Therefore, you can make use of Delta Tables inside your SQL Viewer without any extra steps.
-
Create a Delta Table
CREATE TABLE my_health_data_delta
(
timestamp TIMESTAMP,
heartrate INT,
handmotion INT,
sleepphase VARCHAR(20)
) USING DELTA; -
Run some DML operations
INSERT INTO my_health_data_delta (timestamp, heartrate, handmotion, sleepphase) VALUES
(CAST('2024-10-01 00:00:00' AS TIMESTAMP), 70, 1, 'awake'),
(CAST('2024-10-01 01:00:00' AS TIMESTAMP), 68, 0, 'light'),
(CAST('2024-10-01 02:00:00' AS TIMESTAMP), 65, 0, 'deep'),
(CAST('2024-10-01 03:00:00' AS TIMESTAMP), 64, 1, 'deep'),
(CAST('2024-10-01 04:00:00' AS TIMESTAMP), 66, 0, 'light');DELETE FROM my_health_data_delta
WHERE timestamp = '2024-10-01 02:00:00';UPDATE my_health_data_delta
SET heartrate = 50
WHERE timestamp = '2024-10-01 03:00:00'; -
Look at the version history and details
DESCRIBE HISTORY my_health_data_delta
-
Use time and version travel functionality
SELECT *
FROM my_health_data_delta VERSION AS OF 1Or to travel to a specific date (e.g., one day back):
SELECT *
FROM my_health_data_delta TIMESTAMP AS OF date_sub(current_date(), 1) -
Cleaning up
While Delta Tables are optimized for performance, they can accumulate a large number of files over time. The
VACUUM
operation helps to clean up these files and optimize the table. You can read about it here.VACUUM my_health_data_delta
Or to see which files will be deleted (up to 1000):
VACUUM my_health_data_delta DRY RUN
Apache Hudi
While Apache Hudi is similar to Delta Lake, it has its unique advantages. In Apache Hudi, each partition is organized into file groups. Each file group consists of slices, which contain data files and accompanying log files. The log files record actions and affected data, allowing Hudi to optimize read operations by applying these actions to the base file to produce the latest view of the data.
To streamline the process, configuring Compaction operations is essential. Compaction creates a new slice with an updated base file, improving performance.
This structure, along with Hudi's extensive optimization, enhances write efficiency, making it faster than alternative formats in certain scenarios. Furthermore, Hudi supports NBCC (Non-blocking Concurrency Control) instead of OCC (Optimistic Concurrency Control), which is more effective in environments with high concurrent writes.
In summary, Apache Hudi is well-suited for environments with heavy concurrent write operations and offers more control over custom optimizations, though it may be less user-friendly than Delta Lake.
To learn more about it, you should visit Apache Hudi Documentation Page
How to use Apache Hudi?
Unlike Delta Lake, the Apache Hudi is not preconfigured; therefore, you will need to change the cluster configuration and add the following properties:
{
"spark.jars.packages": "org.apache.hudi:hudi-spark3.5-bundle_2.12:0.15.0",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
"spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
"spark.kryo.registrator": "org.apache.spark.HoodieSparkKryoRegistrar"
}
Remember to make sure that the version of the Hudi-spark's jar is matching the version of the Spark you’re using.
Additionally, since the jar package is not preinstalled, you may face the issue of JVM Heap Out Of Memory during engine initialization. To read more about this issue and how to fix it, please refer to the Troubleshooting section.
-
Create a Hudi Table
CREATE TABLE my_sales_data_hudi (
sale_id STRING,
sale_date TIMESTAMP,
product_id STRING,
quantity INT,
price DECIMAL(10, 2)
) USING HUDI
TBLPROPERTIES (
type = 'mor',
primaryKey = 'sale_id'
);Notice the
type = 'mor'
property, which stands for Merge on Read. This property changes the type of the table, which balances optimization between read and write operations. -
Perform DML operations
INSERT INTO my_sales_data_hudi (sale_id, sale_date, product_id, quantity, price) VALUES
('S001', CAST('2024-10-01 10:00:00' AS TIMESTAMP), 'P001', 10, 99.99),
('S002', CAST('2024-10-01 11:00:00' AS TIMESTAMP), 'P002', 5, 49.99),
('S003', CAST('2024-10-01 12:00:00' AS TIMESTAMP), 'P003', 20, 19.99),
('S004', CAST('2024-10-01 13:00:00' AS TIMESTAMP), 'P004', 15, 29.99),
('S005', CAST('2024-10-01 14:00:00' AS TIMESTAMP), 'P005', 8, 39.99);UPDATE my_sales_data_hudi
SET price = 89.99
WHERE sale_id = 'S001';DELETE FROM my_sales_data_hudi
WHERE sale_id = 'S003'; -
List the commits
CALL show_commits (table => 'my_sales_data_hudi', limit => 5)
Notice here the
commit_time
column, which shows the time of the commit. We can use this value to perform time travel. -
Use the time travel functionality
CALL rollback_to_instant(table => 'my_sales_data_hudi', instant_time => '<commit-time-from-last-step>');
This command will revert the table to the state it was in at the specified commit time. Keep in mind that you can reverse only the latest commit, so if you want to go back further, you need to perform multiple rollbacks.
Apache Iceberg
Apache Iceberg is a high-performance table format specifically designed for data lakes, offering unique capabilities compared to other formats. Built around a distinct architecture, Iceberg uses snapshots to manage table states, avoiding reliance on traditional transaction logs. This snapshot-based design captures the entire state of a table at specific points in time.
Each snapshot holds a manifest list, which in turn references multiple manifests. These manifests organize pointers to specific data files and hold relevant metadata, enabling Iceberg to efficiently track changes without duplicating data files across snapshots. This approach optimizes storage, as snapshots can reuse existing data files, reducing redundancy.
Iceberg also provides powerful features like branching and tagging, allowing users to create branches of tables and assign human-readable tags to snapshots. These features are essential for version control, enabling teams to manage concurrent updates and test changes before committing them to production.
With flexible data organization options and robust versioning capabilities, Apache Iceberg enables scalable, performant data management for modern data lakes.
More on Apache Iceberg Documentation
How to use Apache Iceberg?
Unlike Delta Lake, the Apache Hudi is not preconfigured; therefore, you will need to change the cluster configuration and add the following properties:
{
"spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
"spark.sql.catalog.spark_catalog.type": "hive",
"spark.sql.catalog.spark_catalog.uri": "thrift://ilum-hive-metastore:9083"
}
Remember to make sure that the version of the Iceberg's jar is matching the version of the Spark you’re using and that the Hive Metastore is available at the specified URI.
Additionally, since the jar package is not preinstalled, you may face the issue of JVM Heap Out Of Memory during engine initialization. To read more about this issue and how to fix it, please refer to the Troubleshooting section.
-
Create an Iceberg Table:
CREATE TABLE weather_stations
(
station_id STRING,
reading_time TIMESTAMP,
temperature DECIMAL(4,1),
humidity INT,
wind_speed DECIMAL(4,1)
) USING iceberg -
Insert data into the table
INSERT INTO weather_stations VALUES
('WS001', CAST('2024-03-01 08:00:00' AS TIMESTAMP), 15.5, 65, 12.3),
('WS001', CAST('2024-03-01 09:00:00' AS TIMESTAMP), 17.2, 62, 10.5),
('WS001', CAST('2024-03-01 10:00:00' AS TIMESTAMP), 19.8, 58, 11.7),
('WS002', CAST('2024-03-01 08:00:00' AS TIMESTAMP), 14.2, 70, 8.9),
('WS002', CAST('2024-03-01 09:00:00' AS TIMESTAMP), 16.0, 68, 9.2),
('WS002', CAST('2024-03-01 10:00:00' AS TIMESTAMP), 18.5, 63, 10.1),
('WS003', CAST('2024-03-01 08:00:00' AS TIMESTAMP), 13.7, 72, 15.4),
('WS003', CAST('2024-03-01 09:00:00' AS TIMESTAMP), 15.9, 69, 14.8),
('WS003', CAST('2024-03-01 10:00:00' AS TIMESTAMP), 18.1, 65, 13.2); -
Create a tag from the current snapshot
ALTER TABLE weather_stations CREATE TAG `initial_state`
-
Make some modifications to the data
UPDATE weather_stations
SET
temperature = 16.5
WHERE
station_id = 'WS001'
AND reading_time = CAST('2024-03-01 08:00:00' AS TIMESTAMP)DELETE FROM weather_stations WHERE station_id = 'WS002'
-
List all the snapshots
SELECT * FROM spark_catalog.default.weather_stations.snapshots
Save the timestamp (
commited_at
) of the snapshot you want to time travel to for later. -
Get the history
SELECT * FROM spark_catalog.default.weather_stations.history
-
Rollback to a specific tag
CALL spark_catalog.system.set_current_snapshot (
table => 'spark_catalog.default.weather_stations',
ref => 'initial_state'
) -
Time travel to a specific snapshot
SELECT * FROM weather_stations TIMESTAMP AS OF <timestamp-from-step-5>
How to use UDFs in the SQL viewer?
UDFs (User Defined Functions) are a powerful feature in SQL that allows you to define custom functions to use in your queries. They are also supported in the SQL Viewer, allowing you to extend the functionality of your queries.
-
Create a class for your UDF
package example
import org.apache.hadoop.hive.ql.exec.UDF
class ScalaUDF extends UDF {
def evaluate(str: String): Int = {
str.length()
}
}Make sure to include the necessary dependency of
hive-exec:3.1.3
in your project:<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.3</version>
</dependency> -
Make a jar package and put it into your distributed storage
Make sure to remember the path to your jar file.
-
Add it to the SQL viewer spark session
ADD JAR '<the-path-to-your-jar-file>'
-
Create a function, linked to udf you defined
CREATE
OR REPLACE FUNCTION ScalaUDF as 'example.ScalaUDF' -
Use it in a query
SELECT name, ScalaUDF(name)
FROM newtablename
Troubleshooting
It is useful to know that when executing SQL queries from the SQL Viewer page, the SQL execution engines are visible as normal Ilum Jobs. This means that you can monitor their state, check logs and statistics, just like any other job in the "Jobs" tab.
JVM Heap Out Of Memory during spark-submit
If in engine's launch logs, you find an error like this:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Java heap space
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult (OperationSupport.java:520)
at io.fabric8.kubernetes.client.dsl.internal. OperationSupport.handleResponse(OperationSupport.java:535)
...
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.OutOfMemoryError: Java heap space
...
It is likely caused by the default JVM heap size of Ilum's internal spark-submit being too small for your task.
To fix this, you can switch to the External Spark Submit. This way, the spark submit process will not be constrained to Ilum's pod resources.