Skip to main content

SQL Viewer

What is it?

SQL viewer is a tool that allows you to run spark sql queries right in ilum ui, without writing any scala or python code.

How will it help you?

SQL Viewer is a good choice for reporting. It might help you to avoid creating a whole spark sql program in order to query data from your tables. Instead it is enough to just send the query from user-friendly interface.

It can be useful also for queries debugging during your spark application development process. Imagine you want to develop a spark app where you need to write 10 sql queries like this:

    val dataset = spark.sql("select ...")

To debug each single query, you will need to

  • compile program
  • submit it to spark
  • wait till session will be created

It can take a considerable amount of time to manage data processing directly in Spark, so it’s often more effective to focus on SQL statements instead.

For this purpose, we create a single Spark session and expose it for your spark sql queries When you submit your query through the SQL viewer, it gets passed into this session for processing.

Once the query is executed, the results — including the time spent, logs, and the output — are returned to you.

What if we already have some datasets created by our Spark jobs, and we want to run SQL queries on them?

To facilitate this, we leverage the Hive Metastore and configure Ilum Jobs to automatically integrate with it. While it's possible to save a dataset as an SQL table using regular Spark, this approach confines the dataset to the specific Spark session in which it was created.

In contrast, by using the Hive Metastore, the dataset is saved outside of the Spark session, making it accessible in the Kyuubi engine for future queries.

In Yaml file and sidebar with spark session configurations you can configure spark session to use other spark catalogs instead of Hive, such as Iceberg and Hudi.

Get started with SQL viewer

  1. Create table
    CREATE OR REPLACE TABLE my_health_data (
timestamp TIMESTAMP,
heartrate INT,
handmotion INT,
sleepphase VARCHAR(20)
);
  1. Insert values
    INSERT INTO my_health_data (timestamp, heartrate, handmotion, sleepphase) VALUES
( CAST('2024-10-01 00:00:00' AS TIMESTAMP), 70, 1, 'awake'),
( CAST('2024-10-01 01:00:00' AS TIMESTAMP), 68, 0, 'light'),
( CAST('2024-10-01 02:00:00' AS TIMESTAMP), 65, 0, 'deep'),
( CAST('2024-10-01 03:00:00' AS TIMESTAMP), 64, 1, 'deep'),
( CAST('2024-10-01 04:00:00' AS TIMESTAMP), 66, 0, 'light'),
( CAST('2024-10-01 05:00:00' AS TIMESTAMP), 72, 1, 'awake'),
...
  1. Run Select query
    SELECT sleepphase, avg(heartrate) FROM my_health_data GROUP BY sleepphase
  1. Truncate table
    TRUNCATE TABLE my_health_data
  1. Drop table
    DROP TABLE my_health_data
  1. Observe tables, views in 'resources' folder

Ilum

  1. Observe data in table explorer

Ilum

  1. Use data exploration tool

Ilum

  1. Observe logs and statistics

Statistics demonstrates time spent on query execution and may be useful for query optimization

Ilum

Logs will be useful for query debugging

Ilum

All the operations that can be used with spark sql, can be used here

Tips and Good practises

1. Use SQL viewer on datasets from ilum groups

To do this you will need to save your datasets using

    dataset.write
.mode("overwrite")
.saveAsTable("newtablename")

After that you will be able to use it in sql viewer:

Ilum

2. Make use of spark optimization elements:

For example:

    CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;

By partitioning tables, hive metastore during query handling will be able to filter out non-relevant partitions speeding up query processing. Similarly different data formats can be set [put link to spark sql documentation/hive sql documentation???]

3. Save Queries, import and export sql queries

Ilum

4. Make use of More Advanced Data Formats: Apache Iceberg, Apache Hudi and Delta tables

Features:

  • ACID properties maintanance
  • support of update, delete and merge operations
  • schema evolution: altering tables
  • versioning: ability to timetravel to previous versions of the dataset
  • better optimization in comparison to traditional formats
  • integration of these features for both streaming and batches

What is Delta Lake?

Delta Lake uses Delta Tables as its main format. Delta tables are built by combining parquet file with transaction logs. Transactional logs contain information about versions of the table, what operation lead to this version (update/delete/insert/alter table), and lock for concurrency hanling.

To put it simple, when you make an operation on a partition, new updated version of it is created, while previous version is maintained, so that you could rollback at any time. In addition transactional log file is created for versions management, locking and other things.

This format will be pretty simple to use and a good choice in case you are planning to integrate with databricks.

For a more detailed information about how delta lake handles concurrency with OCC, how does it integrate streaming and batches writes, we strongly reccomend you to visit Delta Lake Documentation

How to use Delta Lake?

By default delta tables are enabled inside of kyuubi spark sessions

    spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.databricks.delta.catalog.update.enabled=true

Therefore you can make use of Delta Tables inside your Sql Viewer without any extra steps.

1. Create Delta Table:

    CREATE TABLE  my_health_data_delta(
timestamp TIMESTAMP,
heartrate INT,
handmotion INT,
sleepphase VARCHAR(20)
) USING DELTA;

2. Apply sql operations such as update, delete, insert

    INSERT INTO my_health_data (timestamp, heartrate, handmotion, sleepphase) VALUES
( CAST('2024-10-01 00:00:00' AS TIMESTAMP), 70, 1, 'awake'),
( CAST('2024-10-01 01:00:00' AS TIMESTAMP), 68, 0, 'light'),
( CAST('2024-10-01 02:00:00' AS TIMESTAMP), 65, 0, 'deep'),
( CAST('2024-10-01 03:00:00' AS TIMESTAMP), 64, 1, 'deep'),
( CAST('2024-10-01 04:00:00' AS TIMESTAMP), 66, 0, 'light');

DELETE FROM my_health_data_delta WHERE timestamp = '2024-10-01 02:00:00';

UPDATE my_health_data_delta SET heartrate = 50
WHERE timestamp = '2024-10-01 03:00:00';

3. Look at the versions history and details

    DESCRIBE HISTORY my_health_data_delta

Ilum

4. Time/Version travel

SELECT * FROM my_health_data_delta VERSION AS OF 1;

-- 1 hour ago
SELECT * FROM my_health_data_delta TIMESTAMP AS OF date_sub(current_date(), 1)

What is Apache Hudi?

Apache Hudi is a similar tool to Delta Lake. However is has some advantages

In Apache Hudi each partition is represented by file groups. Each file group consists of slices. Each slice contains data file, and log files. Log file contains data about action, and also it contains the data that was affected by the action.

When reading, apache hudi takes the base file and applies actions from logs to it.

To optimize the process you should configure Compact operations - to create a new slice with new base file

Such a structure, along with big number of different optimization techniques it allows to optimize the writing operations making them faster in comparison to the

Moreover thanks to such architecutre we can use NBCC - Non Block Concurrency Control instead of OCC - optimistic concurrency control, which show better effectivity for environments with big number of concurrent writes

Ilum

Summarising, apache hudi may be a better option for environments with many concurrent writing operations and for better custom optimization, while being less user-friendly in comparison to Delta Lake.

To learn more about it, you should visit Apache Hudi Documentation Page

How to use Apache Hudi?

1. Configure: Unlike Delta Lake, the Apache Hudi is not preconfigured, therefore you will need to configure it in sidebar in sql viewer.

2. Create Hudi Table:


CREATE TABLE hudi_table (
id int
) USING HUDI TBLPROPERTIES (type='mor')
-- merge means merge on read - this is the architecture with commits and slices

3. Updates, deletes, insert similar to delta lake

4. Look at the commits

    CALL show_commits (table => 'hudi_table', limit => 5)

What is Apache Iceberg?

Apache Iceberg is similar to other data lake table formats but has a distinct architecture. Instead of using commits and transaction log files, Iceberg organizes data using snapshots.

  • Each snapshot captures the state of the table at a specific point in time.
  • A snapshot contains a manifest list, which in turn refers to manifests.
  • Each manifest points to specific data files and contains the corresponding metadata.

This architecture allows different snapshots to reuse data files from previous snapshots, optimizing storage by avoiding data duplication. Additionally, Iceberg offers flexibility in how you organize data files and metadata, enabling custom data layout strategies.

One of Iceberg's key features is its ability to manage branches and tags. Users can create branches of a table and assign them human-readable names for easy reference. This feature is valuable for versioning and managing concurrent table modifications.

Ilum

More on Apapche Iceberg Documentation

How to use Apache Iceberg?

Apache Iceberg is not preconfigured. To configure it add parameters:

    spark.jars.packages=org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.warehouse=s3a://ilum-files/local/warehouse
spark.sql.catalog.local.type=hive
spark.sql.catalog.local.uri=thrift://ilum-hive-metastore:9083

To configuration sidebar:

Ilum

Click Apply and Restart Engine

1. Create Iceberg Table:


CREATE TABLE local.db.mytable (
id int
) USING iceberg

2. Use Update, Delete, Insert operations

3. Create Branch from current snapshot

    ALTER TABLE local.db.mytable CREATE BRANCH `mybranch`

4. Tag current Snapshot

    ALTER TABLE local.db.mytable CREATE TAG `mytag`

5 List All the snapshots

    SELECT * FROM local.db.mytable.snapshots;
SELECT * FROM local.db.mytable.history;

6. Rollback

    call local.system.set_current_snapshot(table => 'local.db.mytable', ref => 'historical-tag')
call local.system.set_current_snapshot(table => 'local.db.mytable', ref => 'mybranch')

call local.system.rollback_to_timestamp('local.db.mytable', TIMESTAMP '2024-10-10 12:20:36.48')

5. How to use UDFs in sql viewer?

You should add jar files with ilum jobs to your app via

  1. Create class for udf
    package example

import org.apache.hadoop.hive.ql.exec.UDF

class ScalaUDF extends UDF {
def evaluate(str: String): Int = {
str.length()
}
}

with package:

    <dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>4.0.0</version>
</dependency>

According to spark sql documentation, you can create spark UDF class and use it as well

  1. Make a jar package and put it into your distributed storage

Ilum

  1. Add it to the sql viewer spark session with simple sql query:
    ADD JAR 's3a://ilum-files/hms-warehouse/udf.jar'
  1. Create a function, linked to udf you defined.
    CREATE OR REPLACE FUNCTION ScalaUDF as 'example.ScalaUDF'
  1. Use it in a query
    SELECT name, ScalaUDF(name) FROM newtablename