Table Explorer
Overview
Table Explorer is a powerful tool for monitoring datasets in your applications and their contents. Here you have
List of all databases and tables
That are created via Ilum Jobs, Ilum Groups or queries in Sql Viewer
Table schema
Lineage
Lineage of each table - the traceable history of transformations and operations applied to that table displayed on UI
Powerful Data Exploration tool
It gives you oportunity to create all kinds of charts, apply mathematical function on the data, apply filters and more.
How to use your datasets as in table explorer?
In order to see you data in table explorer, you need to save your dataset as a table.
There are multiple ways you can do that:
- In Spark Sql:
CREATE TABLE target_table AS (SELECT col1, col2, col3 FROM source_table)
CREATE TABLE target_table (col1 TYPE1, col2 TYPE2, col3 TYPE3);
INSERT INTO target_table (col1, col2, col3) VALUES
(value1_1, value1_2, value1_3)
...
- Programatically in Scala
df.write
.mode("overwrite")
.format("hive")
.saveAsTable("table_name")
Data Exploration Tool
The Data Exploration Tool allows you to interactively explore and visualize a sample of your dataset (default: 1,000 rows, or a custom value of your choice) through an intuitive user interface. This tool enables users to analyze data efficiently, offering a wide range of customization options for data representation and chart generation.
Customizable Axes for Charts
Select columns for the x-axis (horizontal) and y-axis (vertical) to visualize relationships between variables and quickly configure your chart to represent data with precision and flexibility.
Data Aggregation and Grouping
Aggregate and group data using common statistical functions such as:
- Sum
- Mean
- Median
- Standard Deviation
- Variance Apply these functions to your data for more insightful analysis.
Filtering Capabilities
Filter your data based on various data types and conditions.
Diverse Data Representation Formats
Choose from 12 different formats to represent your data visually, including bar charts, line charts, scatter plots, and more. Along with that there are many ways to customize your chart
Exporting Charts
Export your charts in multiple formats:
- CSV (data)
- SVG (vector graphics)
- PNG (image format)
Insights and Deployment
Spark catalogs
In Spark, SQL operations rely on Spark Catalogs, which manage database and table schemas in runtime memory. However, the limitation of Spark Catalogs is that the tables created within them only persist for the duration of the Spark Session. Once the session ends, the table definitions are lost.
Hive Catalog and Hive Metastore
The Hive Catalog addresses this limitation by storing table schemas and metadata in a persistent database called the Hive Metastore. This ensures that table definitions are retained across multiple Spark sessions.
To configure Spark to use the Hive Catalog, you typically need to adjust the Spark session settings as follows:
# To make spark catalog use hive metastorage
spark.sql.catalogImplementation=hive
# URI to Hive Metastore with Thrift protocol
spark.hadoop.hive.metastore.uris=thrift://ilum-hive-metastore:9083
However, Ilum simplifies this process by automatically configuring all Ilum Jobs, Groups, and SQL queries in the SQL Viewer to use the Hive Catalog, eliminating the need for manual setup.
Setting up Hive Metastore: Metadata Database
Typically, to use the Hive Catalog, you must set up the Hive Metastore by completing the following steps:
- Set up the Database: Configure a database to store Hive metadata.
- Set up the Hive Metastore Server: Install and configure the Hive Metastore service.
- Configure the Server to Use the Database: Modify the appropriate XML configuration files (e.g., hive-site.xml) to connect the Hive Metastore to the database.
- Configure the Server to Use the Storage: Set up the storage backend (e.g., HDFS, S3, GCS) by updating the relevant XML files.
These steps can be time-consuming and repetitive.
Ilum simplifies this process by automatically handling the entire Hive Metastore setup, including database and storage configuration.
helm upgrade \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.hiveMetastore.enabled=true \
--reuse-values ilum ilum/ilum
Take into account: in case you use custom credentials for Postgre Sql Database like this:
helm upgrade \
--set postgresql.auth.username=customuser \
--set postgresql.auth.password="CHOOSE PASSWORD" \
--reuse-values ilum ilum/ilum
You must configure Hive Metastore to use these credentials:
helm upgrade \
--set ilum-hive-metastore.postgresql.auth.password="CHOOSE PASSWORD" \
--set ilum-hive-metastore.postgresql.auth.username=customuser \
--reuse-values ilum ilum/ilum
Setting up Hive Metastore: Storage
Storage, also referred to as a Warehouse, is the location where the actual data is stored. Hive supports various storage backends, including:
- HDFS (Hadoop Distributed File System)
- Amazon S3 Buckets and MinIO
- Google Cloud Storage (GCS)
- Windows Azure Storage Blob (WASBS)
Typically, you would need to set up one of these storage options and configure Hive's metastore connection accordingly within an XML file.
However, with Ilum, the S3 MinIO storage is pre-configured for you, and the Hive Metastore is already set up to use it by default. Configuring Other Storage Backends
If you prefer to use an alternative storage backend, you can configure Hive to work with it by reconfiguring your helm values:
For S3 storage or MinIO:
helm upgrade
--set ilum-hive-metastore.storage.type="s3" \
--set ilum-hive-metastore.storage.metastore.warehouse="s3a://yourbucket/yourfolder" \
--set ilum-hive-metastore.storage.s3.accessKey="your_access_key" \
--set ilum-hive-metastore.storage.s3.secretKey="your_secret_key" \
--set ilum-hive-metastore.storage.s3.host="yourhost" \
--set ilum-hive-metastore.storage.s3.port=yourport \
--reuse-values ilum ilum/ilum
For GCS:
helm upgrade
--set ilum-hive-metastore.storage.type="gcs" \
--set ilum-hive-metastore.storage.metastore.warehouse="gs://my-gcs-bucket/path/to/folder/" \
--set ilum-hive.metastore.storage.gcs.clientEmail="your@email" \
--set ilum-hive-metastore.storage.gcs.privateKey="yourprivatekey" \
--set ilum-hive-metastore.storage.gcs.privateKeyId="privatekeyid" \
--reuse-values ilum ilum/ilum
For WASBS:
helm upgrade
--set ilum-hive-metastore.storage.type="wasbs" \
--set ilum-hive-metastore.storage.metastore.warehouse="wasbs://[email protected]/path/to/folder/" \
--set ilum-hive-metastore.storage.wasbs.accountName="youraccountname" \
--set ilum-hive-metastore.storage.wasbs.accessKey="youraccesskey" \
--reuse-values ilum ilum/ilum
For HDFS:
Here you will require specify your hdfs configurations in
ilum-hive-metastore.storage.hdfs.config
You can provide them in hdfs-config.yaml:
helm upgrade
--set ilum-hive-metastore.storage.type="hdfs" \
--set ilum-hive-metastore.storage.metastore.warehouse="hdfs://node:port/path/to/folder" \
--set ilum-hive-metastore.storage.hdfs.hadoopUsername="yourusername" \
--reuse-values ilum ilum/ilum \
-f hdfs-config.yaml
Table Metadata Gathering in Ilum
Ilum uses hive client to gather data about tables and their columns. This way you can see everything in data explorer.
Features
Right now Ilum supports only one Hive Metastore, that is created automatically. We are developing infrastructure for addition of your own Hive Metastores and Metastores of different types.