<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Ilum Blog - Free Data Lakehouse]]></title><description><![CDATA[Ilum's blog includes technical tutorials about Spark on K8s. With Ilum, you can take advantage of the scalability and flexibility of K8s to easily manage and monitor your data lakehouse.. ]]></description><link>https://blog.ilum.cloud/</link><image><url>https://blog.ilum.cloud/favicon.png</url><title>Ilum Blog - Free Data Lakehouse</title><link>https://blog.ilum.cloud/</link></image><generator>Ghost 5.85</generator><lastBuildDate>Mon, 08 Dec 2025 06:03:12 GMT</lastBuildDate><atom:link href="https://blog.ilum.cloud/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[How to run Apache Spark on Kubernetes in less than 5min]]></title><description><![CDATA[Installing Apache Spark on Kubernetes can be streamlined using tools like Ilum. This guide provides step-by-step instructions to efficiently run Spark on your Kubernetes cluster.]]></description><link>https://blog.ilum.cloud/spark-on-kubernetes/</link><guid isPermaLink="false">62ddb59d0799a50001dec6e4</guid><category><![CDATA[News]]></category><category><![CDATA[Updated]]></category><dc:creator><![CDATA[Ilum]]></dc:creator><pubDate>Thu, 20 Nov 2025 15:11:00 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2022/07/Spark-and-Kubernetes-Ilum.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2022/07/Spark-and-Kubernetes-Ilum.png" alt="How to run Apache Spark on Kubernetes in less than 5min"><p><br>Tools like Ilum will go a long way in simplifying the process of installing Apache Spark on Kubernetes. This guide will take you through, step by step, how to run Spark well on your Kubernetes cluster. With Ilum, deploying, managing, and scaling Apache Spark clusters is easily and naturally done.</p>
<!--kg-card-begin: html-->
<div id="post-table-of-contents" max-depth="h2"></div>
<!--kg-card-end: html-->
<h2 id="introduction">Introduction</h2><p>Today, we will showcase how to get up and running with Apache Spark on K8s. There are many ways to do that, but most are complex and require several configurations. We will use <a href="https://ilum.cloud/?ref=blog.ilum.cloud"><strong>Ilum</strong></a> since that will do all the cluster setup for us. In the next blog post, we will compare the usage with the Spark operator.</p><figure class="kg-card kg-image-card"><a href="https://ilum.cloud/?ref=blog.ilum.cloud"><img src="/blog/content/images/2022/11/Groups-dark-mode-1.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="2000" height="523" srcset="/blog/content/images/size/w600/2022/11/Groups-dark-mode-1.png 600w,/blog/content/images/size/w1000/2022/11/Groups-dark-mode-1.png 1000w,/blog/content/images/size/w1600/2022/11/Groups-dark-mode-1.png 1600w,/blog/content/images/size/w2400/2022/11/Groups-dark-mode-1.png 2400w" sizes="(min-width: 720px) 720px"></a></figure><p>Ilum is a free, modular data lakehouse to easily deploy and manage Apache Spark clusters. It has a simple API to define and manage Spark, it will handle all dependencies. It helps with the creation of your own managed spark.</p><p>With Ilum, you can deploy Spark clusters in minutes and get started immediately running Spark applications. Ilum allows you to easily scale out and in your Spark clusters, managing multiple Spark clusters from a single UI.</p><p>With Ilum, getting started is easy if you are relatively new to Apache Spark on Kubernetes.</p><h2 id="step-by-step-guide-to-install-apache-spark-on-kubernetes">Step-by-Step Guide to Install Apache Spark on Kubernetes</h2><h3 id="quick-start">Quick start</h3><p>We assume that you have a Kubernetes cluster up and running, just in case you don&apos;t, check out these instructions to set up a Kubernetes cluster on the minikube. <a href="https://minikube.sigs.k8s.io/docs/start/?ref=blog.ilum.cloud">Check how to install minikube</a>.</p><h3 id="setup-a-local-kubernetes-cluster">Setup a local kubernetes cluster</h3><ul><li><strong>Install Minikube:</strong> Execute the following command to install Minikube along with the recommended resources. This will install Minikube with 6 vCPUs and 12288 MB memory including the metrics server add-on that is necessary for monitoring.</li></ul><pre><code class="language-bash">minikube start --cpus 6 --memory 12288 --addons metrics-server</code></pre><p>Once you have a running Kubernetes cluster, it is just a few commands away to install Ilum:</p><h2 id="install-spark-on-kubernetes-with-ilum">Install Spark on Kubernetes with Ilum</h2><div class="kg-card kg-callout-card kg-callout-card-green"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">You can also use the module selection tool from <a href="https://ilum.cloud/resources/getting-started?ref=blog.ilum.cloud" rel="noreferrer">here</a> to include features like SQL or n8n.</div></div><ol><li><strong>Add </strong><a href="https://artifacthub.io/packages/helm/ilum/ilum?ref=blog.ilum.cloud" rel="noreferrer"><strong>Ilum Helm Repository</strong></a></li></ol><pre><code class="language-bash">helm repo add ilum https://charts.ilum.cloud</code></pre><ol start="2"><li><strong>Install Ilum in Your Cluster</strong></li></ol><p>Here we have a few options. </p><p>a) The recommended one is to start with a few additional modules turned on (Data Lineage, SQL, Data Catalog).</p><pre><code class="language-bash">helm install ilum ilum/ilum \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-sql.enabled=true \
--set ilum-core.sql.enabled=true \
--set global.lineage.enabled=true</code></pre><p>b) you can also start with the most basic option which has only Spark and Jupyter notebooks.</p><pre><code class="language-bash">helm install ilum ilum/ilum</code></pre><p>c) there is also an option to use ilum&apos;s module selection tool <a href="https://ilum.cloud/resources/getting-started?ref=blog.ilum.cloud" rel="noreferrer">here</a>.</p><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text">Slow internet speed and large docker image size can lead to the failure of the Kubernetes pod due to the 2-minute download timeout. That&apos;s why we recommend pulling the image manually without getting a timeout.<br><br><b><strong style="white-space: pre-wrap;">minikube ssh docker pull ilum/core:6.6.0</strong></b></div></div><p>This setup should take around two minutes. Ilum will deploy into your Kubernetes cluster, preparing it to handle Spark jobs.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/11/ilum_spark_pods.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="564" height="143"></figure><p>Once the Ilum is installed, you can access the UI with port-forward and localhost:9777.</p><ol start="3"><li><strong>Port Forward to Access UI:</strong>&#xA0;Use Kubernetes port-forwarding to access the Ilum UI.</li></ol><pre><code class="language-bash">kubectl port-forward svc/ilum-ui 9777:9777
</code></pre><p>Use <strong>admin/admin</strong> as default credentials. You can change them during the <a href="https://ilum.cloud/docs/security/authentication/?ref=blog.ilum.cloud#internal-authentication" rel="noreferrer">deployment process</a>.</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="/blog/content/images/2025/11/spark_ui_6_6_0-1.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="1907" height="581" srcset="/blog/content/images/size/w600/2025/11/spark_ui_6_6_0-1.png 600w,/blog/content/images/size/w1000/2025/11/spark_ui_6_6_0-1.png 1000w,/blog/content/images/size/w1600/2025/11/spark_ui_6_6_0-1.png 1600w,/blog/content/images/2025/11/spark_ui_6_6_0-1.png 1907w" sizes="(min-width: 1200px) 1200px"><figcaption><span style="white-space: pre-wrap;">spark ui</span></figcaption></figure><p>That&#x2019;s all, your kubernetes cluster is now configured to handle spark jobs. Ilum provides a simple API and UI that makes it easy to submit Spark applications. You can also use the good old <a href="https://spark.apache.org/docs/latest/submitting-applications.html?ref=blog.ilum.cloud">spark submit</a>.</p><h3 id="deploy-spark-application-on-kubernetes">Deploy spark application on kubernetes</h3><p>Let&#x2019;s now start a simple spark job. We&apos;ll use the &quot;SparkPi&quot; example from the Spark <a href="https://spark.apache.org/docs/latest/submitting-applications.html?ref=blog.ilum.cloud#launching-applications-with-spark-submit">documentation</a>. You can use the jar file from this <a href="https://ilum.cloud/release/latest/spark-examples_2.12-3.1.2.jar?ref=blog.ilum.cloud">link</a>.</p><figure class="kg-card kg-embed-card kg-card-hascaption"><iframe width="200" height="113" src="https://www.youtube.com/embed/e5-KQgE7Yhc?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen title="Ilum - How to create simple spark job on kubernetes (New)"></iframe><figcaption><p><span style="white-space: pre-wrap;">ilum add spark job</span></p></figcaption></figure><p>Ilum will create a Spark driver kubernetes pod, it uses spark version 3.x docker image. You can control the number of spark executor pods by scaling them to multiple nodes. That&apos;s the simplest way to submit spark applications to K8s.</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2022/07/spark_pod.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="848" height="75" srcset="/blog/content/images/size/w600/2022/07/spark_pod.png 600w,/blog/content/images/2022/07/spark_pod.png 848w"></figure><p>Running Spark on Kubernetes is really easy and frictionless with Ilum. It will configure your whole cluster and present you with an interface where you can manage and monitor your Spark cluster. We believe spark apps on Kubernetes are the future of big data. With Kubernetes, Spark applications will be able to handle huge volumes of data much more reliably, thus giving exact insights and being able to drive decisions with big data.</p><h3 id="submitting-a-spark-application-to-kubernetes-old-style">Submitting a Spark Application to Kubernetes (old style)</h3>
<p>Submitting a Spark job to a Kubernetes cluster involves using the <code>spark-submit</code> script with configurations specific to Kubernetes. Here&apos;s a step-by-step guide:</p>
<p><strong>Steps</strong>:</p>
<ol>
<li>
<p><strong>Prepare the Spark Application</strong>: Package your Spark application into a JAR file (for Scala/Java) or a Python script.</p>
</li>
<li>
<p><strong>Use <code>spark-submit</code> to Deploy</strong>: Execute the <code>spark-submit</code> command with Kubernetes-specific options:</p>
<pre><code class="language-bash">./bin/spark-submit \
  --master k8s://https://&lt;k8s-apiserver-host&gt;:&lt;k8s-apiserver-port&gt; \
  --deploy-mode cluster \
  --name spark-app \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=3 \
  --conf spark.kubernetes.container.image=&lt;your-spark-image&gt; \
  local:///path/to/your-app.jar
</code></pre>
<p>Replace:</p>
<ul>
<li><code>&lt;k8s-apiserver-host&gt;</code>: Your Kubernetes API server host.</li>
<li><code>&lt;k8s-apiserver-port&gt;</code>: Your Kubernetes API server port.</li>
<li><code>&lt;your-spark-image&gt;</code>: The Docker image containing Spark.</li>
<li><code>local:///path/to/your-app.jar</code>: Path to your application JAR within the Docker image.</li>
</ul>
</li>
</ol>
<p><strong>Key Configurations</strong>:</p>
<ul>
<li><code>--master</code>: Specifies the Kubernetes API URL.</li>
<li><code>--deploy-mode</code>: Set to <code>cluster</code> to run the driver inside the Kubernetes cluster.</li>
<li><code>--name</code>: Names your Spark application.</li>
<li><code>--class</code>: Main class of your application.</li>
<li><code>--conf spark.executor.instances</code>: Number of executor pods.</li>
<li><code>--conf spark.kubernetes.container.image</code>: Docker image for Spark pods.</li>
</ul>
<p>For more details, refer to the <a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html?ref=blog.ilum.cloud">Apache Spark Documentation on Running on Kubernetes</a>.</p>
<h3 id="2-creating-a-custom-docker-image-for-spark">2. Creating a Custom Docker Image for Spark</h3>
<p>Building a custom Docker image allows you to package your Spark application and its dependencies, ensuring consistency across environments.</p>
<p><strong>Steps</strong>:</p>
<ol>
<li>
<p><strong>Create a Dockerfile</strong>: Define the environment and dependencies.</p>
<pre><code class="language-dockerfile"># Use the official Spark base image
FROM spark:3.5.3

# Set environment variables
ENV SPARK_HOME=/opt/spark
ENV PATH=$PATH:$SPARK_HOME/bin

# Copy your application JAR into the image
COPY your-app.jar $SPARK_HOME/examples/jars/

# Set the entry point to run your application
ENTRYPOINT [&quot;spark-submit&quot;, &quot;--class&quot;, &quot;org.apache.spark.examples.SparkPi&quot;, &quot;--master&quot;, &quot;local[4]&quot;, &quot;/opt/spark/examples/jars/your-app.jar&quot;]
</code></pre>
<p>In this Dockerfile:</p>
<ul>
<li><code>FROM spark:3.5.3</code>: Uses the official Spark image as the base.</li>
<li><code>ENV</code>: Sets environment variables for Spark.</li>
<li><code>COPY</code>: Adds your application JAR to the image.</li>
<li><code>ENTRYPOINT</code>: Defines the default command to run your Spark application.</li>
</ul>
</li>
<li>
<p><strong>Build the Docker Image</strong>: Use Docker to build your image.</p>
<pre><code class="language-bash">docker build -t your-repo/your-spark-app:latest .
</code></pre>
<p>Replace <code>your-repo/your-spark-app</code> with your Docker repository and image name.</p>
</li>
<li>
<p><strong>Push the Image to a Registry</strong>: Upload your image to a Docker registry accessible by your Kubernetes cluster.</p>
<pre><code class="language-bash">docker push your-repo/your-spark-app:latest
</code></pre>
</li>
</ol>
<p>While using <code>spark-submit</code> is a common method for deploying Spark applications, it may not be the most efficient approach for production environments. Manual submissions can lead to inconsistencies and are challenging to integrate into automated workflows. To enhance efficiency and maintainability, leveraging Ilum&apos;s REST API is recommended.</p>
<p><strong>Automating Spark Deployments with Ilum&apos;s REST API</strong></p>
<p>Ilum offers a robust RESTful API that enables seamless interaction with Spark clusters. This API facilitates the automation of job submissions, monitoring, and management, making it an ideal choice for Continuous Integration/Continuous Deployment (CI/CD) pipelines.</p>
<p><strong>Benefits of Using Ilum&apos;s REST API:</strong></p>
<ul>
<li><strong>Automation</strong>: Integrate Spark job submissions into CI/CD pipelines, reducing manual intervention and potential errors.</li>
<li><strong>Consistency</strong>: Ensure uniform deployment processes across different environments.</li>
<li><strong>Scalability</strong>: Easily manage multiple Spark clusters and jobs programmatically.</li>
</ul>
<p><strong>Example: Submitting a Spark Job via Ilum&apos;s REST API</strong></p>
<p>To submit a Spark job using Ilum&apos;s REST API, you can make an HTTP POST request with the necessary parameters. Here&apos;s a simplified example using <code>curl</code>:</p>
<pre><code class="language-bash">curl -X POST https://&lt;ilum-server&gt;/api/v1/job/submit \
  -H &quot;Content-Type: multipart/form-data&quot; \
  -F &quot;name=example-job&quot; \
  -F &quot;clusterName=default&quot; \
  -F &quot;jobClass=org.apache.spark.examples.SparkPi&quot; \
  -F &quot;jars=@/path/to/your-app.jar&quot; \
  -F &quot;jobConfig=spark.executor.instances=3;spark.executor.memory=4g&quot;
</code></pre>
<p>In this command:</p>
<ul>
<li><code>name</code>: Specifies the job name.</li>
<li><code>clusterName</code>: Indicates the target cluster.</li>
<li><code>jobClass</code>: Defines the main class of your Spark application.</li>
<li><code>jars</code>: Uploads your application JAR file.</li>
<li><code>jobConfig</code>: Sets Spark configurations, such as the number of executors and memory allocation.</li>
</ul>
<p>For detailed information on the API endpoints and parameters, refer to the <a href="https://ilum.cloud/docs/api/?ref=blog.ilum.cloud">Ilum API Documentation</a>.</p>
<p><strong>Enhancing Efficiency with Interactive Spark Jobs</strong></p>
<p>Beyond automating job submissions, transforming Spark jobs into interactive microservices can significantly optimize resource utilization and response times. Ilum supports the creation of long-running interactive Spark sessions that can process real-time data without the overhead of initializing a new Spark context for each request.</p>
<p><strong>Advantages of Interactive Spark Jobs:</strong></p>
<ul>
<li><strong>Reduced Latency</strong>: Eliminates the need to start a new Spark context for every job, leading to faster execution.</li>
<li><strong>Resource Optimization</strong>: Maintains a persistent Spark context, allowing for efficient resource management.</li>
<li><strong>Scalability</strong>: Handles multiple requests concurrently within the same Spark session.</li>
</ul>
<p>To implement an interactive Spark job with Ilum, you can define a Spark application that listens for incoming data and processes it in real-time. This approach is particularly beneficial for applications requiring immediate data processing and response.</p>
<p>For a comprehensive guide on setting up interactive Spark jobs and optimizing your Spark cluster, refer to Ilum&apos;s blog post: <a href="https://ilum.cloud/blog/how-to-optimize-your-spark-cluster-with-interactive-spark-jobs/?ref=blog.ilum.cloud">How to Optimize Your Spark Cluster with Interactive Spark Jobs</a>.</p>
<p>By integrating Ilum&apos;s REST API and adopting interactive Spark jobs, you can streamline your Spark workflows, enhance automation, and achieve a more efficient and scalable data processing environment.</p>
<h2 id="advantages-of-using-ilum-to-run-spark-on-kubernetes">Advantages of Using Ilum to run Spark on Kubernetes</h2><p>Ilum is equipped with an intuitive UI and a resilient API to scale and handle Spark clusters, configuring a couple of Spark applications from one interface. Here are a few great features in that regard:</p><ol><li><strong>Ease of Use</strong>: Ilum simplifies Spark configuration and management on Kubernetes with an intuitive Spark UI, eliminating complex setup processes.</li><li><strong>Quick Deployment:</strong>&#xA0;Setup, deploy, and scale Spark clusters in minutes to speed up the time to execution and testing applications right away.</li><li><strong>Scalability:</strong> Using the Kubernetes API, easily scale Spark clusters up or down to meet your data processing needs, ensuring optimal resource utilization.</li><li><strong>Modularity</strong>: Ilum comes with a modular framework that allows users to choose and combine different components such as Spark History Server, Apache Jupyter, Minio, and much more.</li></ol><h2 id="migrating-from-apache-hadoop-yarn">Migrating from Apache Hadoop Yarn</h2><p>Now that Apache Hadoop Yarn is in deep stagnation, more and more organizations are looking toward migrating from Yarn to Kubernetes. This is attributed to several reasons, but most common is that Kubernetes provides a more resilient and flexible platform in matters of managing Big Data workloads.<br><br>Generally, it is difficult to carry out a platform migration of the data processing platform from Apache Hadoop Yarn to any other. There are many factors to consider when such a switch is made&#x2014;compatibility of data, speed, and cost of processing. However, it would come smoothly and successfully if the procedure is well planned and executed.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/08/Selection_353.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="1913" height="468" srcset="/blog/content/images/size/w600/2022/08/Selection_353.png 600w,/blog/content/images/size/w1000/2022/08/Selection_353.png 1000w,/blog/content/images/size/w1600/2022/08/Selection_353.png 1600w,/blog/content/images/2022/08/Selection_353.png 1913w" sizes="(min-width: 720px) 720px"></figure><p>Kubernetes is pretty much a natural fit when it comes to Big Data workloads because of its inherent ability to be able to scale horizontally. But, with Hadoop Yarn, you are limited to the number of nodes in your cluster. You can increase and reduce the number of nodes inside a Kubernetes cluster on demand. <br><br>It also allows features which are not available in Yarn, for instance: self-healing and horizontal scaling.</p><h2 id="time-to-make-the-switch-to-kubernetes">Time to make the Switch to Kubernetes?</h2><p>As the world of big data continues to evolve, so do the tools and technologies used to manage it. For years, Apache Hadoop YARN has been the de facto standard for resource management in big data environments. But with the rise of containerization and orchestration technologies like Kubernetes, is it time to make the switch?</p><p>Kubernetes has been gaining popularity as a container orchestration platform, and for good reason. It&apos;s flexible, scalable, and relatively easy to use. If you&apos;re still using traditional VM-based infrastructure, now might be the time to make the switch to Kubernetes.</p><p>If you&apos;re working with containers, then you should definitely care about Kubernetes. It can help you manage and deploy your containers more effectively, and it&apos;s especially useful if you&apos;re working with a lot of containers or if you&apos;re deploying your containers to a cloud platform.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/08/ui-dashboard.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="2000" height="1196" srcset="/blog/content/images/size/w600/2022/08/ui-dashboard.png 600w,/blog/content/images/size/w1000/2022/08/ui-dashboard.png 1000w,/blog/content/images/size/w1600/2022/08/ui-dashboard.png 1600w,/blog/content/images/size/w2400/2022/08/ui-dashboard.png 2400w" sizes="(min-width: 720px) 720px"></figure><p>Kubernetes is also a great choice if you&apos;re looking for an orchestration tool that&apos;s backed by a major tech company. Google has been using Kubernetes for years to manage its own containerized applications, and they&apos;ve invested a lot of time and resources into making it a great tool.</p><p>There is no clear winner in the YARN vs. Kubernetes debate. The best solution for your organization will depend on your specific needs and use cases. If you are looking for a more flexible and scalable resource management solution, Kubernetes is worth considering. If you need better support for legacy applications, YARN may be a better option.<br><br>Whichever platform you choose, Ilum can help you get the most out of it. Our platform is designed to work with both YARN and Kubernetes, and our team of experts can help you choose and implement the right solution for your organization.</p><h2 id="managed-spark-cluster">Managed Spark cluster</h2><p>A managed Spark cluster is a cloud-based solution that makes it easy to provision and manage Spark clusters. It provides a web-based interface for creating and managing Spark clusters, as well as a set of APIs for automating cluster management tasks. Managed Spark clusters are often used by data scientists and developers who want to quickly provision and manage Spark clusters without having to worry about the underlying infrastructure.</p><p>Ilum provides the ability to create and manage your own spark cluster, which can be run in any environment, including cloud, on-premises, or a mixture of both.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/09/ilum-ferret-0.png" class="kg-image" alt="How to run Apache Spark on Kubernetes in less than 5min" loading="lazy" width="1024" height="1002" srcset="/blog/content/images/size/w600/2022/09/ilum-ferret-0.png 600w,/blog/content/images/size/w1000/2022/09/ilum-ferret-0.png 1000w,/blog/content/images/2022/09/ilum-ferret-0.png 1024w" sizes="(min-width: 720px) 720px"></figure><h2 id="the-pros-of-apache-spark-on-kubernetes">The Pros of Apache Spark on Kubernetes</h2><p>There has been some debate about whether Apache Spark should run on Kubernetes. </p><p>Some people argue that Kubernetes is too complex and that Spark should continue to run on its own dedicated cluster manager or stay in the cloud. Others argue that Kubernetes is the future of big data processing and that Spark should embrace it.</p><p>We are in the latter camp. We believe that Kubernetes is the future of big data processing and that Apache Spark should run on Kubernetes.</p><p>The biggest benefit of using Spark on Kubernetes is that it allows for much easier scaling of Spark applications. This is because Kubernetes is designed to handle deployments of large numbers of concurrent containers. So, if you have a Spark application that needs to process a lot of data, you can simply deploy more containers to the Kubernetes cluster to process the data in parallel. This is much easier than setting up a new Spark cluster on EMR each time you need to scale up your processing. You can run it on any cloud platform (AWS, Google Cloud, Azure, etc.) or on-premises. This means that you can easily move your Spark applications from one environment to another without having to worry about changing your cluster manager.</p><p>Another enormous benefit is that it allows for more flexible workflows. For example, if you need to process data from multiple sources, you can easily deploy different containers for each source and have them all processed in parallel. This is much easier than trying to manage a complex workflow on a single Spark cluster.</p><p>Kubernetes has several security features that make it a more attractive option for running Spark applications. For example, Kubernetes supports role-based access control, which allows you to fine-tune who has access to your Spark cluster.</p><p>So there you have it. These are just some of the reasons why we believe that Apache Spark should run on Kubernetes. If you&apos;re not convinced, we encourage you to try it out for yourself. We think you&apos;ll be surprised at how well it works. </p><h2 id="additional-resources">Additional Resources</h2><ul><li><a href="https://minikube.sigs.k8s.io/docs/start/?ref=blog.ilum.cloud" rel="noreferrer">Check how to install Minikube</a></li><li><a href="https://kubernetes.io/docs/home/?ref=blog.ilum.cloud" rel="noreferrer">Kubernetes Documentation</a></li><li><a href="https://ilum.cloud/?ref=blog.ilum.cloud" rel="noreferrer">Ilum Official Website</a></li><li><a href="https://ilum.cloud/docs/?ref=blog.ilum.cloud" rel="noreferrer">Ilum Official Documentation</a></li><li><a href="https://artifacthub.io/packages/helm/ilum/ilum?ref=blog.ilum.cloud" rel="noreferrer">Ilum Helm Chart</a></li></ul><h2 id="conclusion">Conclusion</h2><p>Ilum simplifies the process of installing and managing Apache Spark on Kubernetes, making it an ideal choice for both beginners and experienced users. By following this guide, you&#x2019;ll have a functional Spark cluster running on Kubernetes in no time.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://ilum.cloud/?ref=blog.ilum.cloud" class="kg-btn kg-btn-accent">Try it, it&apos;s free</a></div>]]></content:encoded></item><item><title><![CDATA[How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes]]></title><description><![CDATA[Step-by-step guide to deploy the Ilum data platform on Google Cloud in under 30 minutes. Build a modern data lakehouse with Apache Spark, Kubernetes, and SQL.]]></description><link>https://blog.ilum.cloud/install-ilum-data-platform-google-cloud/</link><guid isPermaLink="false">68e7e868ebd34a00016649f6</guid><dc:creator><![CDATA[Florian Roscheck]]></dc:creator><pubDate>Fri, 10 Oct 2025 14:58:11 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2025/10/thumb_final_edited2-10.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2025/10/thumb_final_edited2-10.png" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes"><p>Learn how to deploy the <strong>Ilum Data Platform</strong> on Google Cloud for development and experimentation and see how to go from a fast <strong>data lakehouse</strong> setup to production-ready data pipelines. Prefer a guided path? Level up with the Ilum Course at <a href="https://ilumcourse.com/?utm_source=ilum_blog&amp;utm_medium=article&amp;utm_campaign=gcp_setup" rel="noreferrer"><u>IlumCourse.com</u></a>.</p><h2 id="why-ilum-as-your-data-lakehouse-platform">Why Ilum as Your Data Lakehouse Platform</h2>
<p><strong>Ilum</strong> is a Kubernetes-native <strong>data platform</strong> that lets teams stand up a modern <strong>data lakehouse</strong> in minutes, not months. You get:</p>
<ul>
<li><strong>Apache Spark</strong> as primary compute (with optional <strong>Trino</strong> for SQL),</li>
<li>Built-in SQL editor &amp; notebooks, full <strong>Jupyter</strong> integration,</li>
<li>Orchestration &amp; operations for Spark jobs and &#x201C;virtual clusters&#x201D;</li>
<li>Lineage &amp; versioning (table diffs, ERD &amp; column-level lineage),</li>
<li>Integrated OSS: <strong>Airflow</strong>, <strong>Superset</strong> (BI), <strong>MLflow</strong>, <strong>Gitea</strong>, and more.</li>
</ul>
<p>This guide shows a lean, single-VM setup on Google Cloud, perfect for learning, <strong>POCs</strong>, and <strong>sandboxing</strong> your lakehouse workflows.</p>
<h2 id="who-this-guide-is-for">Who This Guide Is For</h2>
<ul>
<li><strong>Developers &amp; Data Engineers</strong> evaluating a modern <strong>data platform</strong> on Google Cloud</li>
<li><strong>Analysts</strong> wanting quick SQL + dashboarding on a <strong>data lakehouse</strong></li>
<li>Teams needing a repeatable dev environment before hardening to production</li>
</ul>
<h2 id="what-you%E2%80%99ll-build">What You&#x2019;ll Build</h2>
<p>A single-VM <strong>Ilum data platform</strong> on Google Cloud that runs a development-grade <strong>data lakehouse</strong>:</p>
<ul>
<li>Launch <strong>Kubernetes</strong>, <strong>Helm</strong>, and Ilum via a startup script</li>
<li>Run Spark jobs, SQL queries, and create <strong>Superset</strong> dashboards</li>
<li>Explore built-in <strong>data lineage</strong> and table <strong>versioning</strong></li>
</ul>
<h2 id="your-fast-track-to-a-google-cloud-data-lakehouse-with-ilum">Your Fast Track to a Google Cloud Data Lakehouse with Ilum</h2><p>Setting up a new data platform can feel overwhelming: Too many tools, too many options. Ilum changes that. In this quick-start guide, you&#x2019;ll get Ilum running on Google Cloud in under 30 minutes &#x2013; a perfect setup for starting to explore Ilum.</p><p>This guide is perfect for developers, data engineers, and analysts who want to understand Ilum&#x2019;s workflow in a hands-on way, but might not want or be able to install Ilum on their local machine. By the end, you will be ready to experiment with real data microservices, Spark jobs, and dashboards &#x2013; just like in the official Ilum Course.</p><h2 id="step-1-log-into-google-cloud">Step 1: Log into Google Cloud</h2><p>First, sign in to the Google Cloud Console. If you don&#x2019;t yet have an account, create one. You&#x2019;ll get a free trial and can start experimenting right away.</p><p>Here is what you should see once you have logged in:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-ae6a8942-99ee-47eb-9d17-f25ba92f73bb.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="764" height="452" srcset="/blog/content/images/size/w600/2025/10/data-src-image-ae6a8942-99ee-47eb-9d17-f25ba92f73bb.png 600w,/blog/content/images/2025/10/data-src-image-ae6a8942-99ee-47eb-9d17-f25ba92f73bb.png 764w" sizes="(min-width: 720px) 720px"></figure><h2 id="step-2-create-a-new-project">Step 2: Create a new Project</h2><p>For billing and project structuring purposes, we will create a new project to run the VM (Virtual Machine) in. Here is how to create a project:</p><p>a. Click on &quot;Select a project&quot;:&#xA0;</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-fe6abbf8-68ca-4509-ae50-11586c76b956.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="286" height="138"></figure><p>&#xA0;<br>b. Then, click on &quot;<strong>New project</strong>&quot; on the top right: </p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-fa67e4b6-5580-4e36-a561-7a66de3e1e88.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="298" height="147"></figure><p>&#xA0;</p><p>c. Enter &quot;<strong>ilum-vm</strong>&quot; as a project name. You can leave the organization unassigned, then click &quot;<strong>Create</strong>&quot;:&#xA0;</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-852232bd-7792-462a-865a-d31f57ab9663.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="478" height="207"></figure><h2 id="step-3-set-up-billing-don%E2%80%99t-worry-it%E2%80%99s-minimal">Step 3: Set up Billing (Don&#x2019;t Worry, it&#x2019;s minimal!)</h2><p>For experimenting with Ilum, we&#x2019;ll run its setup on a single virtual machine (VM). Expect costs around $0.10-$0.40 per hour, depending on your configuration. You will have to pay for the virtual machine and the associated disk space you use for your Ilum installation &#x2013;&#xA0;unless you receive the free trial. For this, we need to set up billing.</p><ol><li>Click on the search bar on the top and search for &quot;<strong>billing</strong>&quot;. Once the&#xA0;<strong>&quot;Billing&quot;&#xA0;</strong>product appears, click it.</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-755eceac-2381-4a50-9520-adcf7990a082.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="723" height="301" srcset="/blog/content/images/size/w600/2025/10/data-src-image-755eceac-2381-4a50-9520-adcf7990a082.png 600w,/blog/content/images/2025/10/data-src-image-755eceac-2381-4a50-9520-adcf7990a082.png 723w" sizes="(min-width: 720px) 720px"></figure><ol start="2"><li>Click on &quot;Create account&quot;:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-ed14cb5c-8099-4049-9f12-25834d00e167.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="356" height="166"></figure><ol start="3"><li>Now, follow all instructions and add a payment method until you see the newly created billing account Overview:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-9a8ef28b-ab5d-497f-ab02-2dfa783026fe.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="570" height="284"></figure><ol start="4"><li>Next, we need to assign our &quot;<strong>ilum-vm&quot;</strong>&#xA0;project to this billing account. Search for the project in the search bar and click it:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-159dfe7c-1333-4059-a56c-90e2acb1ea91.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="613" height="164" srcset="/blog/content/images/size/w600/2025/10/data-src-image-159dfe7c-1333-4059-a56c-90e2acb1ea91.png 600w,/blog/content/images/2025/10/data-src-image-159dfe7c-1333-4059-a56c-90e2acb1ea91.png 613w"></figure><ol start="5"><li>Click on &quot;<strong>Billing&quot;</strong>&#xA0;in the &quot;Quick access&quot; area:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-a7e6e826-67e1-46cd-9696-c5526096e24a.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="781" height="464" srcset="/blog/content/images/size/w600/2025/10/data-src-image-a7e6e826-67e1-46cd-9696-c5526096e24a.png 600w,/blog/content/images/2025/10/data-src-image-a7e6e826-67e1-46cd-9696-c5526096e24a.png 781w" sizes="(min-width: 720px) 720px"></figure><ol start="6"><li>Click on &quot;Link a billing account&quot;:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-b2591d6d-5bb3-4e67-9aae-f19c5ee67784.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="699" height="282" srcset="/blog/content/images/size/w600/2025/10/data-src-image-b2591d6d-5bb3-4e67-9aae-f19c5ee67784.png 600w,/blog/content/images/2025/10/data-src-image-b2591d6d-5bb3-4e67-9aae-f19c5ee67784.png 699w"></figure><ol start="7"><li>Select the billing account you created in step 3 above and the click on &quot;<strong>Set account</strong>&quot;:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-56cae8b2-25d7-498c-813b-8cf430c663e0.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="622" height="296" srcset="/blog/content/images/size/w600/2025/10/data-src-image-56cae8b2-25d7-498c-813b-8cf430c663e0.png 600w,/blog/content/images/2025/10/data-src-image-56cae8b2-25d7-498c-813b-8cf430c663e0.png 622w"></figure><p>Perfect, we are done setting up the billing for our new project!</p><h2 id="step-4-enable-the-compute-api">Step 4: Enable the Compute API</h2><p>In order to create a virtual machine (in Google Cloud called &quot;compute&quot;), we need to give it permission to do this. This is done via enabling the Compute API. Here is how that works:</p><ol><li>Search for&#xA0;<strong>&quot;Compute Engine API&quot;</strong>&#xA0;in the search bar and click on it:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-83c74758-6553-4fd6-942e-a99e5feb619a.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="754" height="154" srcset="/blog/content/images/size/w600/2025/10/data-src-image-83c74758-6553-4fd6-942e-a99e5feb619a.png 600w,/blog/content/images/2025/10/data-src-image-83c74758-6553-4fd6-942e-a99e5feb619a.png 754w" sizes="(min-width: 720px) 720px"></figure><ol start="2"><li>Click on&#xA0;<strong>&quot;Enable&quot;</strong>:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-c6323c57-97d7-40f7-8ddc-2c2f476fc82c.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="428" height="308"></figure><p>Enabling may take a minute or two. Here is how things should look like after the API has been enabled. Note the &quot;Status: Enabled&quot; statement:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-7b63c6bc-d0a0-4d0f-b548-9fcb00863944.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="726" height="366" srcset="/blog/content/images/size/w600/2025/10/data-src-image-7b63c6bc-d0a0-4d0f-b548-9fcb00863944.png 600w,/blog/content/images/2025/10/data-src-image-7b63c6bc-d0a0-4d0f-b548-9fcb00863944.png 726w" sizes="(min-width: 720px) 720px"></figure><h2 id="step-5-install-the-google-cloud-cli">Step 5: Install the Google Cloud CLI</h2><p>Setting up the virtual machine to run Ilum us much easier using the Google Cloud command line interface (CLI). Install it as described in the&#xA0;<a href="https://cloud.google.com/sdk/docs/install?ref=blog.ilum.cloud"><u>Google documentation</u></a>. Make sure to run&#xA0;<strong>gcloud init</strong>&#xA0;to authenticate with your Google account.</p><div class="kg-card kg-callout-card kg-callout-card-accent"><div class="kg-callout-emoji">&#x1F4A1;</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Can&apos;t Install the Google Cloud CLI?</strong></b><br><br>On work computers, a lack of admin rights might mean you are unable to install the Google Cloud CLI. You can still install Ilum on Google Cloud and use it in the course. Scroll further down in this article to find an instruction for an alternative installation route that is slightly more cumbersome than using the CLI, but works without local installation.</div></div><h2 id="step-6-launch-the-virtual-machine">Step 6: Launch the Virtual Machine</h2><p>With all preparations taken, let&apos;s start the virtual machine and install Ilum!</p><h3 id="download-the-startup-script">Download the startup script:</h3><p><a href="https://www.headindata.com/download-gcp-vm-startup-script?ref=blog.ilum.cloud"><u>Click here to download the script</u></a></p><p>The startup script is what we will use to automatically initialize the virtual machine. It will install Kubernetes, Helm, and Ilum &#x2013; steps you could also take manually (but that would take you more time).</p><h3 id="instruct-the-google-cloud-cli-to-create-the-machine">Instruct the Google Cloud CLI to create the machine</h3><p>Paste the VM creation command below into a terminal/command line on your computer.</p><p>Make sure to run the terminal from the same directory where you have downloaded the startup script to or point the command below to the absolute path of the&#xA0;<strong>start_vm.sh</strong>&#xA0;script (line 13).</p><p>You can also adjust the&#xA0;<strong>ZONE</strong>&#xA0;to fit a zone close to you. Here,&#xA0;<strong>europe-north2-a</strong>&#xA0;was selected for its low price and sustainable footprint.</p><pre><code class="language-bash">PROJECT_ID=ilum-vm 
ZONE=europe-north2-a 

gcloud config set project $PROJECT_ID 
gcloud compute instances create ilum-dev-node \
  --zone $ZONE \ 
  --machine-type e2-custom-12-18432 \ 
  --provisioning-model=SPOT \ 
  --instance-termination-action=STOP \ 
  --maintenance-policy=TERMINATE \ 
  --network-interface=subnet=default,network-tier=STANDARD \ 
  --image-family ubuntu-2204-lts --image-project ubuntu-os-cloud \ 
  --boot-disk-type=pd-balanced --boot-disk-size=100GB \ 
  --metadata-from-file startup-script=start_vm.sh </code></pre><p>Once you have made all necessary adjustments, execute the code. Now, the machine should start up. This step installs Kubernetes, Helm, and Ilum so you can explore a development-grade data lakehouse on a single VM.</p><p><strong>Note:</strong>&#xA0;Above, we have chosen a&#xA0;<em>preemptible machine</em>. This helps keep cost at ca. 1/4 of a non-preemptible machine but has an important disadvantage: When Google Cloud needs the compute for other clients, your virtual machine will be shut down. If you are willing to spend more on your virtual machine for the benefit of avoiding random shutdowns, then remove lines 7, 8, and 9 from the command above before creating the machine.</p><h3 id="monitor-the-startup-process">Monitor the startup process</h3><p>After 5-10 seconds, execute the following code to watch what is happening inside the virtual machine:</p><pre><code class="language-bash">gcloud compute ssh ilum-dev-node --zone $ZONE &#x2013; \ &apos;sudo tail -f /var/log/startup-script.log&apos;</code></pre><p>Wait until &quot;Startup script completed successfully.&quot; appears, this may take a couple of minutes:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-6d6890f8-b846-4f7e-afc4-2cd471ec9dde.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="716" height="322" srcset="/blog/content/images/size/w600/2025/10/data-src-image-6d6890f8-b846-4f7e-afc4-2cd471ec9dde.png 600w,/blog/content/images/2025/10/data-src-image-6d6890f8-b846-4f7e-afc4-2cd471ec9dde.png 716w"></figure><h2 id="step-7-access-the-ilum-ui">Step 7: Access the Ilum UI</h2><p>We&apos;re ready to explore Ilum! Run the following command to forward the port of the Ilum UI on the remote virtual machine to your own computer:</p><pre><code class="language-bash">gcloud compute ssh ilum-dev-node --zone=$ZONE &#x2013; -L 31777:localhost:31777</code></pre><p>Now, open&#xA0;<a href="http://localhost:31777/?ref=blog.ilum.cloud"><u>http://localhost:31777</u></a>&#xA0;in your browser to access Ilum. Access works as long as the above command is running.</p><p>To log in to your new Ilum installation, use the&#xA0;<strong>username &quot;admin&quot;&#xA0;</strong>and the&#xA0;<strong>password &quot;admin&quot;</strong>. Welcome to your new Ilum installation:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-049389ef-3f2a-4fa2-9538-39dfef6ccfa5.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="771" height="481" srcset="/blog/content/images/size/w600/2025/10/data-src-image-049389ef-3f2a-4fa2-9538-39dfef6ccfa5.png 600w,/blog/content/images/2025/10/data-src-image-049389ef-3f2a-4fa2-9538-39dfef6ccfa5.png 771w" sizes="(min-width: 720px) 720px"></figure><p>Think of the Ilum UI as the control plane of your data platform, run Spark jobs, write SQL, check data lineage and build dashboards.</p><p><strong>Important:</strong>&#xA0;This is strictly an installation for development (and course taking!) purposes &#x2013; it lacks proper load balancing, security, and resilience to be used in a production environment.</p><h2 id="step-8-manage-the-virtual-machine">Step 8: Manage the Virtual Machine</h2><p>As long as the virtual machine is running and as long as disk space is reserved for the machine, it will cause cost. You can learn more about this cost&#xA0;<a href="https://cloud.google.com/products/calculator?hl=en&amp;dl=CjhDaVF5WVRBNU9URmtZeTAyWVRsaExUUXpZakl0T0dabE5DMWpPVEZpTm1VMU1HTXpOVE1RQVE9PRAIGiQxNDQ4RUJENy05REE2LTQyQUMtOTFBMC1DRjMxMTI1NjUwRTc&amp;ref=blog.ilum.cloud"><u>here</u></a>.</p><h3 id="shutting-down">Shutting down</h3><p>When you are not using the machine, e.g. when you have finished your learning session, shut down your machine like this:</p><pre><code class="language-bash">gcloud compute instances stop ilum-dev-node --zone=$ZONE</code></pre><h3 id="restarting">Restarting</h3><p>To restart the machine, use this command:</p><pre><code class="language-bash">gcloud compute instances start ilum-dev-node --zone=$ZONE</code></pre><p>When you have restarted the machine, it will take some minutes until Ilum is up and running again. You can check the status of Ilum&apos;s Kubernetes pods with the following command once you have connected to the virtual machine as shown in the last step:</p><pre><code class="language-bash">kubectl get pods -n ilum</code></pre><p>Ilum is ready to connect once all of its core containers are running again:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-47173976-a23b-4898-92e6-49c33a67d7e7.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="690" height="382" srcset="/blog/content/images/size/w600/2025/10/data-src-image-47173976-a23b-4898-92e6-49c33a67d7e7.png 600w,/blog/content/images/2025/10/data-src-image-47173976-a23b-4898-92e6-49c33a67d7e7.png 690w"></figure><h3 id="removing">Removing</h3><p>To completely remove the machine, incl. the attached storage, and stop it from causing any cost, use this command:</p><pre><code class="language-bash">gcloud compute instances delete ilum-dev-node --zone &quot;$ZONE&quot; --delete-disks=all</code></pre><h1 id="alternative-route-without-google-cloud-cli">Alternative Route Without Google Cloud CLI</h1><p>If you cannot install the Google Cloud CLI on your machine as described above, then the setup of a virtual machine to run Ilum is slightly more involved. Follow the instructions below. This assumes that you have followed the tutorial above until and including &quot;Step 4: Enable the Compute API&quot;.</p><h2 id="set-up-a-firewall-rule">Set up a Firewall Rule</h2><p>Considering you intend to use the virtual machine for development purposes only, will not upload sensitive data, and Ilum has its own authentication system, the easiest and most practical way to expose Ilum for you to work with it is via opening a port in the firewall of the virtual machine.</p><p>We will only open this port to your own IP &#x2013; decreasing the risk of unintended access by a third party. However, if you are on a company network, many devices might share the same IP which means your colleagues might be able to access the Ilum instance and log in if they know the IP of the instance and have Ilum&apos;s access credentials (the standard is &quot;admin&quot; as username and &quot;admin&quot; as password).</p><ol><li>Search for&#xA0;<strong>&quot;firewall&quot;</strong>&#xA0;in the search bar and click on it:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-c064d05d-5513-42ee-b2cf-df0a1f9617e5.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="615" height="195" srcset="/blog/content/images/size/w600/2025/10/data-src-image-c064d05d-5513-42ee-b2cf-df0a1f9617e5.png 600w,/blog/content/images/2025/10/data-src-image-c064d05d-5513-42ee-b2cf-df0a1f9617e5.png 615w"></figure><ol start="2"><li>Click on&#xA0;&quot;Create firewall rule&quot;:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-4f983e52-f9b9-44dd-a0ae-aa5841ca9a52.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="570" height="94"></figure><ol start="3"><li>Configure the following settings. Make sure to enter your IP, followed by &quot;/32&quot;. You can find out your IP via websites like&#xA0;<a href="https://ipinfo.io/what-is-my-ip?ref=blog.ilum.cloud"><u>ipinfo.io</u></a>.</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-02d5d458-6a07-4ad2-9ba7-37ea43ea481a.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="562" height="1182"></figure><h2 id="configure-and-launch-the-virtual-machine">Configure and Launch the Virtual Machine</h2><ol><li>Search for&#xA0;<strong>&quot;compute engine&quot;&#xA0;</strong>(not &quot;compute engine api&quot; as before)&#xA0;in the search bar and click on it:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-809735c7-05aa-4a1f-bbb7-238034ff0dfd.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="567" height="154"></figure><ol start="2"><li>Click on&#xA0;&quot;Create instance&quot;:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-3520e546-6cfb-4fba-bc0e-4585122e9bc7.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="534" height="205"></figure><ol start="3"><li>In the&#xA0;<strong>Machine Configuration</strong>&#xA0;tab, make the following settings:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-df43e0f4-d4dd-4406-bf40-6e92fb3ef3e2.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="697" height="826" srcset="/blog/content/images/size/w600/2025/10/data-src-image-df43e0f4-d4dd-4406-bf40-6e92fb3ef3e2.png 600w,/blog/content/images/2025/10/data-src-image-df43e0f4-d4dd-4406-bf40-6e92fb3ef3e2.png 697w"></figure><ol start="4"><li>Under&#xA0;<strong>OS and Storage</strong>, set up the virtual machine like this:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-adb03b04-872b-41fa-a5cc-e1a3607baed0.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="615" height="433" srcset="/blog/content/images/size/w600/2025/10/data-src-image-adb03b04-872b-41fa-a5cc-e1a3607baed0.png 600w,/blog/content/images/2025/10/data-src-image-adb03b04-872b-41fa-a5cc-e1a3607baed0.png 615w"></figure><ol start="5"><li>In the&#xA0;<strong>&quot;Data Protection&quot;&#xA0;</strong>tab, select&#xA0;<strong>&quot;No backups&quot;&#xA0;</strong>(this will save cost).</li><li>In the&#xA0;<strong>&quot;Networking&quot;&#xA0;</strong>tab, add the&#xA0;<strong>&quot;ilum-ui&quot;</strong>&#xA0;tag:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-df1141a0-26a9-441a-b451-d220a12bc5a5.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="747" height="344" srcset="/blog/content/images/size/w600/2025/10/data-src-image-df1141a0-26a9-441a-b451-d220a12bc5a5.png 600w,/blog/content/images/2025/10/data-src-image-df1141a0-26a9-441a-b451-d220a12bc5a5.png 747w" sizes="(min-width: 720px) 720px"></figure><ol start="7"><li>Under&#xA0;&quot;Observability&quot;, disable&#xA0;&quot;Install Ops agent for Monitoring and Logging&quot;&#xA0;(this will save cost).</li><li>In the&#xA0;<strong>&quot;Advanced&quot;&#xA0;</strong>tab: Download the startup script below, open it in a text editor, and copy and paste its contents into the &quot;<strong>Startup script</strong>&quot; field. Then, select&#xA0;<strong>&quot;Spot&quot;</strong>&#xA0;as VM provisioning model (optional, read more about this in Step 2 under &quot;Step 6: Launch the Virtual Machine&quot; above):</li></ol><p><a href="https://www.headindata.com/download-gcp-vm-startup-script?ref=blog.ilum.cloud"><u>Click here to download the script</u></a></p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-9aaec3b3-c0df-4b1a-afb1-243d4ba3f3cf.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="750" height="550" srcset="/blog/content/images/size/w600/2025/10/data-src-image-9aaec3b3-c0df-4b1a-afb1-243d4ba3f3cf.png 600w,/blog/content/images/2025/10/data-src-image-9aaec3b3-c0df-4b1a-afb1-243d4ba3f3cf.png 750w" sizes="(min-width: 720px) 720px"></figure><ol start="9"><li>Then, finally, click on&#xA0;<strong>&quot;Create&quot;</strong>&#xA0;to create the virtual machine:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-73ada6f0-f37e-42fe-ad02-5d0bbcc23811.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="469" height="140"></figure><ol start="10"><li>Startup, incl. starting Ilum, will take a couple of minutes. You can monitor the status of the Ilum initialization via the logs:</li></ol><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2025/10/data-src-image-7774f8cc-f1f5-48c9-a7fb-d5f604863ea6.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="920" height="423" srcset="/blog/content/images/size/w600/2025/10/data-src-image-7774f8cc-f1f5-48c9-a7fb-d5f604863ea6.png 600w,/blog/content/images/2025/10/data-src-image-7774f8cc-f1f5-48c9-a7fb-d5f604863ea6.png 920w"></figure><p>Once you see &quot;Startup script completed successfully&quot;, Ilum is ready to use. (To see live logs, you might have to click on the &quot;stream logs&quot; button on the top right.)</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-614f621d-ed31-4505-a083-5d8532206247.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="548" height="303"></figure><h2 id="connect-to-ilum">Connect to Ilum</h2><p>To connect to Ilum, copy the External IP of the virtual machine into your browser bar (you can find it in Google Cloud in the list of VM instances, see below) and append &quot;:31777&quot; to it.&#xA0;</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-37b0f8a4-d776-47f7-932a-af6eafc612cb.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="737" height="161" srcset="/blog/content/images/size/w600/2025/10/data-src-image-37b0f8a4-d776-47f7-932a-af6eafc612cb.png 600w,/blog/content/images/2025/10/data-src-image-37b0f8a4-d776-47f7-932a-af6eafc612cb.png 737w" sizes="(min-width: 720px) 720px"></figure><p>When you connect to this URL, you should see the Ilum login screen:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2025/10/data-src-image-1f2eee4d-5820-48f6-8e94-e36887d0ecc4.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="550" height="354"></figure><p>Login with Ilum&apos;s out-of-the-box credentials to get started (Username: admin, Password: admin).</p><h2 id="manage-the-virtual-machine">Manage the Virtual Machine</h2><p>As long as the virtual machine is running and as long as disk space is reserved for the machine, it will cause cost. You can learn more about this cost&#xA0;<a href="https://cloud.google.com/products/calculator?hl=en&amp;dl=CjhDaVF5WVRBNU9URmtZeTAyWVRsaExUUXpZakl0T0dabE5DMWpPVEZpTm1VMU1HTXpOVE1RQVE9PRAIGiQxNDQ4RUJENy05REE2LTQyQUMtOTFBMC1DRjMxMTI1NjUwRTc&amp;ref=blog.ilum.cloud"><u>here</u></a>.</p><p>It is recommended to&#xA0;<strong>stop</strong>&#xA0;the machine when you are not using Ilum. This will bring cost down significantly, as you will only be paying for persistent storage but not for CPUs and RAM. After having stopped the machine, you can&#xA0;<strong>resume</strong>&#xA0;it when you are experimenting with Ilum again.</p><p>Once you are completely done experimenting with Ilum,&#xA0;delete&#xA0;the machine incl. its attached storage &#x2013; after this, you will not incur any cost.</p><p>All machine management options are available in the menu available via the 3 dots in the far right of the machine table:</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2025/10/data-src-image-c52f59be-4d0c-47e5-bc1b-c0c0668a7b19.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="858" height="404" srcset="/blog/content/images/size/w600/2025/10/data-src-image-c52f59be-4d0c-47e5-bc1b-c0c0668a7b19.png 600w,/blog/content/images/2025/10/data-src-image-c52f59be-4d0c-47e5-bc1b-c0c0668a7b19.png 858w"></figure><p><strong>Congratulations! </strong>You now have Ilum up and running on Google Cloud and are ready to experiment, build, and explore. But if you want to rapidly go from &#x201C;it works&#x201D; to production-ready pipelines, Spark microservices, and live dashboards, then the&#xA0;<strong>official Ilum Course</strong>&#xA0;is your fast track. In just a few hours you&#x2019;ll build real, deployable components (SQL, Spark jobs, Superset dashboards) and get hands-on guidance from the instructor and the Ilum team.</p><figure class="kg-card kg-image-card"><a href="https://ilumcourse.com/?utm_source=ilum_blog&amp;utm_medium=article&amp;utm_campaign=gcp_setup"><img src="/blog/content/images/2025/10/blog_header_hires.png" class="kg-image" alt="How to Run ILUM Data Lakehouse on Google Cloud in Under 30 Minutes" loading="lazy" width="2000" height="783" srcset="/blog/content/images/size/w600/2025/10/blog_header_hires.png 600w,/blog/content/images/size/w1000/2025/10/blog_header_hires.png 1000w,/blog/content/images/size/w1600/2025/10/blog_header_hires.png 1600w,/blog/content/images/2025/10/blog_header_hires.png 2000w" sizes="(min-width: 720px) 720px"></a></figure><p>Join now!<br><a href="https://ilumcourse.com/?utm_source=ilum_blog&amp;utm_medium=article&amp;utm_campaign=gcp_setup" rel="noreferrer"><u>Enroll in the Ilum Course &#x2192;</u></a></p><h2 id="from-dev-to-production-hardening-your-data-lakehouse">From Dev to Production: Hardening Your Data Lakehouse</h2>
<p>This single-VM setup is for development only. To run Ilum as a production-grade <strong>data platform</strong> on Google Cloud, plan to:</p>
<ul>
<li>Deploy on a managed Kubernetes cluster (e.g., GKE) with <strong>multi-node</strong> resilience</li>
<li>Add <strong>load balancing</strong>, <strong>TLS/HTTPS</strong>, and <strong>OIDC/SSO</strong></li>
<li>Configure <strong>backup/restore</strong>, <strong>autoscaling</strong>, and <strong>observability</strong></li>
<li>Separate compute and storage; use <strong>object storage</strong> for data durability</li>
<li>Set up <strong>access control</strong> and <strong>network policies</strong></li>
</ul>
<h2 id="keep-building-with-ilum">Keep Building with Ilum</h2>
<ul>
<li><strong>Architecture Overview</strong> &#x2192; <a href="https://ilum.cloud/docs/architecture/?ref=blog.ilum.cloud">https://ilum.cloud/docs/architecture/</a></li>
<li><strong>Use Cases</strong> (e.g., Transactions) &#x2192; <a href="https://ilum.cloud/docs/use_cases/transaction/?ref=blog.ilum.cloud">https://ilum.cloud/docs/use_cases/transaction/</a></li>
<li><strong>Docs Home</strong> &#x2192; <a href="https://ilum.cloud/docs/?ref=blog.ilum.cloud">https://ilum.cloud/docs/</a></li>
<li><strong>Ilum Course</strong> &#x2192; <a href="http://https//IlumCourse.com?utm_source=ilum_blog&amp;utm_medium=article&amp;utm_campaign=gcp_setup">https://IlumCourse.com</a></li>
</ul>
<p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p><p><br><br></p>]]></content:encoded></item><item><title><![CDATA[Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)]]></title><description><![CDATA[End-to-end guide to Data Science on Kubernetes: launch Jupyter, with Spark using Ilum’s free data lakehouse, Livy API, and production best practices.]]></description><link>https://blog.ilum.cloud/data-science-on-kubernetes/</link><guid isPermaLink="false">634c949fb1575600013b87dd</guid><dc:creator><![CDATA[Ilum]]></dc:creator><pubDate>Fri, 08 Aug 2025 11:41:00 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2022/12/ilum-ferret2-3.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2022/12/ilum-ferret2-3.png" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)"><p><br>If you&#x2019;ve done data science long enough, you&#x2019;ve met this moment: you open a Jupyter notebook, hit <strong>Shift+Enter</strong>, and the cell just&#x2026; hangs. Somewhere a Spark driver is starved for memory, or a cluster is &#x201C;almost&#x201D; configured, or the library versions don&#x2019;t quite agree. You&#x2019;re here to explore data, not babysit infrastructure.</p><p>That&#x2019;s where <strong>Kubernetes</strong> shines and where <strong>Ilum</strong>, a free, cloud-native data lakehouse focused on <strong>running and monitoring Apache Spark on Kubernetes</strong>, helps you go from &#x201C;why is this stuck?&#x201D; to &#x201C;let&#x2019;s try that model&#x201D; with far less friction.</p><p>In this post, we&#x2019;ll take a narrative walk from a clean laptop to a working <strong>Jupyter</strong> and <strong>Apache Zeppelin</strong> setup on <strong>Kubernetes</strong>, both backed by <strong>Apache Spark</strong> and connected through a <strong>Livy-compatible API</strong>. You&#x2019;ll see how Ilum plugs into the picture, why it matters for day-to-day data work, and how to scale from a single demo box to a real cluster without rewriting your notebooks.</p><h2 id="why-kubernetes-for-data-science">Why Kubernetes for Data Science<br></h2><p>Kubernetes (K8s) gives you three things notebooks secretly crave:</p><ul><li><strong>Elasticity</strong>: Executors scale up when a join explodes, and back down when you&#x2019;re idle.</li><li><strong>Isolation</strong>: Each user or team runs in a clean, containerized environment&#x2014;no shared-conda-env roulette.</li><li><strong>Repeatability</strong>: &#x201C;It works on my machine&#x201D; becomes &#x201C;it works because it&#x2019;s declarative.&#x201D;</li></ul><p>Pair K8s with <strong>Spark</strong> and a notebook front-end, and you get an interactive analytics platform that grows with your data. Pair it with <strong>Ilum</strong>, and you also get the plumbing and observability&#x2014;<strong>logs and metrics</strong>&#x2014;that keep you moving when things go sideways.</p><h2 id="the-missing-piece-speaking-livy">The missing piece: speaking Livy<br></h2><p>Both <strong>Jupyter (via Sparkmagic)</strong> and <strong>Zeppelin (via its Livy interpreter)</strong> speak the <strong>Livy REST API</strong> to manage Spark sessions. Ilum implements that API through an embedded <strong><code>ilum-livy-proxy</code></strong>, so your notebooks can create and use Spark sessions while Ilum handles the Spark-on-K8s lifecycle behind the scenes.</p><p>Think of it this way:</p><blockquote><strong>Notebook &#x2192; Livy API &#x2192; Ilum &#x2192; Spark on Kubernetes</strong><br>You write cells. Ilum speaks Kubernetes. Spark does the heavy lifting.</blockquote><p>No special notebook rewrites, no custom glue.</p><h2 id="a-gentle-from-scratch-run-through">A gentle, from-scratch run-through<br></h2><p>Let&#x2019;s build a tiny playground locally with <strong>Minikube</strong>. Later, you can point the exact same setup at GKE/EKS/AKS or prem k8s distro.</p><h3 id="1-start-a-small-kubernetes-cluster">1) Start a small Kubernetes cluster</h3><p>You don&#x2019;t need production horsepower to try this&#x2014;just a few CPUs and memory.</p><pre><code class="language-bash">minikube start --cpus 6 --memory 12288 --addons metrics-server

kubectl get nodes</code></pre><p>You&#x2019;ll see a node and the metrics add-on come up. That&#x2019;s enough to host Spark drivers and executors for a demo.</p><h3 id="2-install-ilum-with-jupyter-zeppelin-and-the-livy-proxy">2) Install Ilum with Jupyter, Zeppelin, and the Livy proxy</h3><p>Ilum ships Helm charts so you don&#x2019;t have to handcraft manifests.</p><pre><code class="language-bash">helm repo add ilum https://charts.ilum.cloud
helm repo update

helm install ilum ilum/ilum \
  --set ilum-zeppelin.enabled=true \
  --set ilum-jupyter.enabled=true \
  --set ilum-livy-proxy.enabled=true

kubectl get pods -w</code></pre><p>Grab a coffee while pods settle. When the dust clears, you have Ilum&#x2019;s core plus bundled <strong>Jupyter</strong> and <strong>Zeppelin </strong>ready to talk to Spark through the <strong>Livy-compatible proxy</strong>.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/jupyter_zeppelin_pods.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="367" height="196"></figure><h3 id="accessing-the-ilum-ui-after-installation">Accessing the Ilum UI after installation</h3><p>a) Port-forward (quick local access)</p><pre><code class="language-bash">kubectl port-forward svc/ilum-ui 9777:9777</code></pre><p>Open: <code>http://localhost:9777</code></p><p><strong>b) NodePort (stable for testing)</strong><br>For testing, a <strong>NodePort</strong> is enabled by default (avoids port-forward drops).<br>Open: <code>http://&lt;KUBERNETES_NODE_IP&gt;:31777</code><br><em>Tip:</em> Get the node IP with <code>kubectl get nodes -o wide</code> (or <code>minikube ip</code> on Minikube).</p><p><strong>c) Minikube shortcut</strong></p><pre><code class="language-bash">minikube service ilum-ui</code></pre><p>This opens the service in your browser (or prints the URL) using your Minikube IP.</p><p><strong>d) Production</strong><br>Use an <strong>Ingress</strong> (TLS, domain, auth). See: <a href="https://ilum.cloud/docs/configuration/ilum-ui/?ref=blog.ilum.cloud#ilum-ui-ingress-parameters" rel="noopener">https://ilum.cloud/docs/configuration/ilum-ui/#ilum-ui-ingress-parameters</a></p><hr><h2 id="first-contact-jupyter-sparkmagic-via-ilum-ui">First contact: Jupyter + Sparkmagic (via Ilum UI)</h2><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/jupyter_logo.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="684" height="437" srcset="/blog/content/images/size/w600/2022/12/jupyter_logo.png 600w,/blog/content/images/2022/12/jupyter_logo.png 684w"></figure><p>Open the <strong>Ilum UI</strong> in your browser. From the left navigation bar, go to <strong>Notebooks</strong>.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/jupyter_ui.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="1274" height="654" srcset="/blog/content/images/size/w600/2022/12/jupyter_ui.png 600w,/blog/content/images/size/w1000/2022/12/jupyter_ui.png 1000w,/blog/content/images/2022/12/jupyter_ui.png 1274w" sizes="(min-width: 720px) 720px"></figure><p>In Jupyter:</p><ol><li>Create a <strong>Python 3</strong> notebook.</li><li>Load Sparkmagic:</li></ol><pre><code class="language-python">%load_ext sparkmagic.magics</code></pre><ol start="3"><li>Open the spark session manager</li></ol><pre><code class="language-python">%manage_spark</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="/blog/content/images/2025/08/Screenshot-from-2025-08-09-16-38-06.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="750" height="955" srcset="/blog/content/images/size/w600/2025/08/Screenshot-from-2025-08-09-16-38-06.png 600w,/blog/content/images/2025/08/Screenshot-from-2025-08-09-16-38-06.png 750w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">jupyter ilum spark form</span></figcaption></figure><p><strong>Basic Settings</strong></p><ul><li><strong>Endpoint</strong> &#x2013; Select the preconfigured Livy endpoint (usually <code>http://ilum-livy-proxy:8998</code>). This is how Jupyter/Sparkmagic talks to Ilum.</li><li><strong>Cluster</strong> &#x2013; Choose the target cluster (e.g., <code>default</code>). If you manage multiple K8s clusters in Ilum, pick the one you want your driver/executors to run on.</li><li><strong>Session Name</strong> &#x2013; Any short identifier (e.g., <code>eda-january</code>). You&#x2019;ll see this in Sparkmagic session lists.</li><li><strong>Language</strong> &#x2013; <code>python</code> (PySpark) or <code>scala</code>. Most Jupyter users go with Python.</li><li><strong>Spark Image</strong> &#x2013; The container image for your Spark driver/executors (e.g., <code>ilum/spark:3.5.3-delta</code>). Images tagged with <code>-delta</code> already include Delta Lake.</li><li><strong>Extra Packages</strong> &#x2013; Comma-separated extras to pull into the session (e.g., <code>numpy,delta</code>).<ul><li>Tip: if you selected a <code>-delta</code> image, you usually don&#x2019;t need to add <code>delta</code> again.</li></ul></li><li><strong>Enable autopause</strong> &#x2013; When checked, Ilum will automatically pause the session after it&#x2019;s idle for a while to save resources. You can resume it from the UI.</li></ul><p><strong>Resource Settings</strong></p><ul><li><strong>Driver Settings</strong><ul><li><strong>Driver Memory</strong> &#x2013; Start with <code>1g</code> for demos; bump to <code>2&#x2013;4g</code> for heavier notebooks.</li><li><strong>Driver Cores</strong> &#x2013; <code>1</code> is fine for exploratory work; increase if your driver does more coordination/collects.</li></ul></li><li><strong>Executor Settings</strong> (collapsed by default)<ul><li>Configure only if you want to override defaults; many users rely on dynamic allocation (below).</li></ul></li></ul><p><strong>More Advanced Options</strong></p><ul><li><strong>Custom Spark Config</strong> &#x2013; JSON map for <code>spark.*</code> keys (e.g., event logs, S3 creds, serializer). Example:jsonCopyEdit<code>{<br>  &quot;spark.eventLog.enabled&quot;: &quot;true&quot;,<br>  &quot;spark.sql.adaptive.enabled&quot;: &quot;true&quot;<br>}</code><br></li><li><strong>SQL Extension</strong> &#x2013; Pre-fills for Delta Lake: <code>io.delta.sql.DeltaSparkSessionExtension</code>. Leave as-is if you plan to read/write Delta tables.</li><li><strong>Driver Extra Java Options</strong> &#x2013; JVM flags for the driver. The defaults (<code>-Divy.cache.dir=/tmp -Divy.home=/tmp</code>) keep Ivy dependency caches inside the container.</li><li><strong>Executor Extra Java Options</strong> &#x2013; Same idea, but for executors. Leave empty unless you need specific JVM flags.</li><li><strong>Dynamic Allocation</strong> &#x2013; Let Spark scale executors automatically.<ul><li><strong>Min Executors</strong> &#x2013; Floor for scaling (e.g., <code>1</code>).</li><li><strong>Initial Executors</strong> &#x2013; Startup size (e.g., <code>2&#x2013;3</code>).</li><li><strong>Max Executors</strong> &#x2013; Ceiling (e.g., <code>10</code> for demos, higher in prod).</li></ul></li><li><strong>Shuffle Partitions</strong> &#x2013; Number of partitions for wide ops (e.g., <code>200</code>). Rule of thumb: start near 2&#x2013;3&#xD7; total executor cores, then tune.</li></ul><p>Click <strong>Create Session</strong>. Ilum will start the Spark driver and executors on Kubernetes; the first pull can take a minute if images are new. When the status flips to <strong>available</strong>, you&#x2019;re ready to run cells:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/spark_session_started.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="1117" height="186" srcset="/blog/content/images/size/w600/2022/12/spark_session_started.png 600w,/blog/content/images/size/w1000/2022/12/spark_session_started.png 1000w,/blog/content/images/2022/12/spark_session_started.png 1117w" sizes="(min-width: 720px) 720px"></figure><pre><code class="language-python">%%spark
spark.range(0, 100000).selectExpr(&quot;count(*) as n&quot;).show()</code></pre><pre><code class="language-python">%%spark
from pyspark.sql import Row
rows = [Row(id=1, city=&quot;Warsaw&quot;), Row(id=2, city=&quot;Riyadh&quot;), Row(id=3, city=&quot;Austin&quot;)]
print(rows)</code></pre><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/jupyter_ilum.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="1442" height="811" srcset="/blog/content/images/size/w600/2022/12/jupyter_ilum.png 600w,/blog/content/images/size/w1000/2022/12/jupyter_ilum.png 1000w,/blog/content/images/2022/12/jupyter_ilum.png 1442w" sizes="(min-width: 720px) 720px"></figure><p>It&#x2019;s a small example, but the important part is what just happened: <strong>Jupyter spoke Livy. Ilum created a Spark session on Kubernetes, Spark did the work</strong>. You didn&#x2019;t touch a single YAML by hand.</p><h2 id="or-if-you-prefer-apache-zeppelin">Or if you prefer: Apache Zeppelin</h2><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/zeppelin_classic_logo.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="462" height="313"></figure><p>Some teams love Zeppelin for its multi-language paragraphs and shareable notes. That works here too.</p><ol><li>To execute code, we need to create a note:</li></ol><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/zeppelin_step_1.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="968" height="451" srcset="/blog/content/images/size/w600/2022/12/zeppelin_step_1.png 600w,/blog/content/images/2022/12/zeppelin_step_1.png 968w" sizes="(min-width: 720px) 720px"></figure><p>2. As the communication with Ilum is handled via livy-proxy, we need to choose livy as a default interpreter.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/create_ilum.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="634" height="391" srcset="/blog/content/images/size/w600/2022/12/create_ilum.png 600w,/blog/content/images/2022/12/create_ilum.png 634w"></figure><p>3. Now let&#x2019;s open the note and put some code into the paragraph:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/zeppelin_step_2.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="901" height="682" srcset="/blog/content/images/size/w600/2022/12/zeppelin_step_2.png 600w,/blog/content/images/2022/12/zeppelin_step_2.png 901w" sizes="(min-width: 720px) 720px"></figure><p><br>Same to Jupyter, Zeppelin has also a predefined configuration that is needed for Ilum. You can customize the settings easily. Just open the context menu in the top right corner and click the interpreter button.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/zeppelin_step_3.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="228" height="268"></figure><p>There is a long list of interpreters and their properties that could be customized.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/12/zeppelin_step_4.png" class="kg-image" alt="Running Jupyter on Kubernetes with Spark (and Why Ilum Makes It Easy)" loading="lazy" width="1426" height="379" srcset="/blog/content/images/size/w600/2022/12/zeppelin_step_4.png 600w,/blog/content/images/size/w1000/2022/12/zeppelin_step_4.png 1000w,/blog/content/images/2022/12/zeppelin_step_4.png 1426w" sizes="(min-width: 720px) 720px"></figure><p>Zeppelin provides 3 different modes to run the interpreter process: shared, scoped, and isolated. You can learn more about the interpreter binding mode <a href="https://zeppelin.apache.org/docs/0.8.0/usage/interpreter/interpreter_binding_mode.html?ref=blog.ilum.cloud">here</a>.</p><p>Same experience: Zeppelin sends code through Livy. Ilum spins up and manages the session on K8s. Spark runs it.</p><hr><h2 id="a-quick-detour-into-production-thinking-without-killing-the-vibe">A quick detour into production thinking (without killing the vibe)</h2><p>Here are the ideas you&#x2019;ll care about when this graduates from demo to team-wide platform:</p><ul><li><strong>Storage</strong>: For a lakehouse, use <strong>object storage</strong> (S3/MinIO/GCS/Azure Blob). Keep <strong>Spark event logs</strong> there so the <strong>History Server</strong> can give you post-mortems that aren&#x2019;t just vibes.</li><li><strong>Security</strong>: Put Jupyter/Zeppelin behind SSO (OIDC/Keycloak), scope access with Kubernetes <strong>RBAC</strong>, and keep secrets in a manager, not in a notebook cell.</li><li><strong>Autoscaling</strong>: Let the cluster scale node pools; let <strong>Spark dynamic allocation</strong> manage executors. Your wallet and your patience will thank you.</li><li><strong>Costs</strong>: Spot/preemptible nodes for executors, right-size memory/cores, and avoid tiny files (Parquet/Delta/Iceberg for the win).</li></ul><p>You don&#x2019;t need to implement all of this today. The point is: <strong>you&#x2019;re not stuck</strong>. The notebook you wrote for Minikube is the same notebook you&#x2019;ll run next quarter on EKS.</p><h2 id="field-notes-little-problems-you%E2%80%99ll-actually-hit">Field notes: little problems you&#x2019;ll actually hit</h2><ul><li><strong>&#x201C;Session stuck starting.&#x201D;</strong> Usually resource pressure. Either give Minikube a bit more (<code>--memory 14336</code>) or lower Spark requests/limits for the demo.</li><li><strong><code>ImagePullBackOff</code>.</strong> Your node can&#x2019;t reach the registry, or you need <code>imagePullSecrets</code>. Easy fix, don&#x2019;t overthink it.</li><li><strong>Slow reads on big datasets.</strong> You&#x2019;re paying the <strong>tiny-file tax</strong> or skipping predicate pushdown. Compact to Parquet/Delta and filter early.</li></ul><p>The good news: Ilum&#x2019;s <strong>logs and metrics</strong> make these less mysterious. You&#x2019;ll still debug&#x2014;but with tools, not folklore.</p><h2 id="do-you-actually-need-kubernetes-for-data-science">Do you actually need Kubernetes for data science?</h2><p>Strictly speaking? <strong>No.</strong> Plenty of useful analysis runs on a single machine. But as soon as your team grows or your data size stops being cute, <strong>Kubernetes</strong> buys you standardized environments, sane isolation, and predictable scaling. The more people share the same platform, the more those properties matter.</p><p>And the nice part is: with <strong>Ilum</strong>, moving to Spark on K8s doesn&#x2019;t require tearing up your notebooks or learning the entire Kubernetes dictionary on day one. You point <strong>Jupyter/Zeppelin</strong> at a <strong>Livy-compatible</strong> endpoint and keep going.</p><h2 id="where-to-go-next">Where to go next</h2><ul><li>Keep this demo, but try a real dataset: NYC Taxi, clickstream, retail baskets, anything columnar and not tiny.</li><li>Add <strong>Spark event logging</strong> to object storage so the <strong>History Server</strong> can tell you what actually happened.</li><li>If you&#x2019;re on a cloud provider, deploy the same chart to <strong>GKE/EKS/AKS</strong>, add ingress + TLS, and connect SSO.</li></ul><p>If you want a deeper dive, auth, storage classes, GPU pools for deep learning, or an example migration from <strong>YARN to Kubernetes, </strong>say the word and I&#x2019;ll spin up a follow-up with concrete manifests.</p><p>Copy-paste corner</p><pre><code class="language-bash"># Start local Kubernetes
minikube start --cpus 4 --memory 12288 --addons metrics-server

# Install Ilum with Jupyter, Zeppelin, and Livy proxy
helm repo add ilum https://charts.ilum.cloud
helm repo update
helm install ilum ilum/ilum \
  --set ilum-zeppelin.enabled=true \
  --set ilum-jupyter.enabled=true \
  --set ilum-livy-proxy.enabled=true

# Open Jupyter and get the token
kubectl port-forward svc/ilum-jupyter 8888:8888
kubectl logs -l app.kubernetes.io/name=ilum-jupyter

# Open Zeppelin
kubectl port-forward svc/ilum-zeppelin 8080:8080
</code></pre><pre><code class="language-python">#Jupyter cell to test:
%load_ext sparkmagic.magics
%manage_spark  # choose the predefined Ilum endpoint, then create session

%%spark
spark.range(0, 100000).selectExpr(&quot;count(*) as n&quot;).show()
</code></pre><h2 id="a-gentle-nudge-to-try-ilum">A gentle nudge to try Ilum</h2><p>Ilum is free, <strong>cloud-native</strong>, and built to make <strong>Spark on Kubernetes</strong> practical for actual teams&#x2014;not just demo videos. You get the <strong>Livy-compatible endpoint</strong>, <strong>interactive sessions</strong>, and <strong>logs/metrics</strong> all in one place, so your notebooks feel like notebooks again.</p><ul><li><a href="https://ilum.cloud/resources/getting-started?ref=blog.ilum.cloud">https://ilum.cloud/resources/getting-started</a></li></ul>]]></content:encoded></item><item><title><![CDATA[Data Lakehouse: Transforming Enterprise Data Management]]></title><description><![CDATA[Explore the concept of the data lakehouse, a modern approach to data management that combines the benefits of data lakes and data warehouses.]]></description><link>https://blog.ilum.cloud/understanding-the-data-lakehouse/</link><guid isPermaLink="false">66f4988cebd34a00016645ff</guid><dc:creator><![CDATA[Ilum]]></dc:creator><pubDate>Sat, 23 Nov 2024 21:14:00 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2024/09/data-lakehouse-ferret.webp" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2024/09/data-lakehouse-ferret.webp" alt="Data Lakehouse: Transforming Enterprise Data Management"><p>In recent years,&#xA0;<a href="https://ilum.cloud/data-lakehouse?ref=blog.ilum.cloud">data lakehouses</a>&#xA0;have emerged as an essential component for managing expansive data systems. Acting as the bridge between traditional data warehouses and contemporary data lakes, they bring together the strengths of both. This integration allows us to handle large data volumes efficiently and solve critical challenges faced in the data science landscape.</p><p>By blending the high-performance aspects of data warehouses with the scalability of data lakes, data lakehouses offer a unique solution. They address issues relating to data storage, management, and accessibility, making them indispensable in our digital era. As we explore this concept further, we&apos;ll uncover why data lakehouses are superior to the systems we once relied upon and the crucial role they play in ensuring data security and governance.</p><h3 id="key-takeaways">Key Takeaways</h3><ul><li>Data lakehouses combine features of data lakes and data warehouses.</li><li>They address major challenges in data storage and management.</li><li>Effective data governance is essential in data lakehouses.</li></ul>
<!--kg-card-begin: html-->
<div id="post-table-of-contents" max-depth="h2"></div>
<!--kg-card-end: html-->
<h2 id="what-is-a-data-lakehouse">What is a Data Lakehouse?</h2><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/FAnR4R5JMM8?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen title="What is the Data Lakehouse?"></iframe></figure><h3 id="how-does-a-lakehouse-operate">How Does a Lakehouse Operate?</h3><p>In essence, a lakehouse combines features of data lakes and data warehouses. We gain the scalability and cost advantages of a data lake while benefiting from the management and performance of a warehouse. This design enables us to carry out analytics on both structured and unstructured data within a single framework. By removing isolated data storage, lakehouses facilitate better flow and integration.</p><h3 id="tracing-the-origin-of-relational-databases">Tracing the Origin of Relational Databases</h3><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/The-Rise-of-Relational-Databases.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/The-Rise-of-Relational-Databases.webp 600w,/blog/content/images/size/w1000/2024/09/The-Rise-of-Relational-Databases.webp 1000w,/blog/content/images/size/w1600/2024/09/The-Rise-of-Relational-Databases.webp 1600w,/blog/content/images/2024/09/The-Rise-of-Relational-Databases.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>Understanding the significance of a lakehouse requires a look back at the evolution of&#xA0;data management. In the 1980s, as businesses recognized the importance of insights, there emerged a need for systems that could handle extensive data. This transition led to the development of relational databases. They revolutionized data management by introducing SQL and ensuring data integrity with ACID properties.</p><h3 id="understanding-transaction-processing">Understanding Transaction Processing</h3><p>At its core, transaction processing manages real-time data alterations. This involves inserting, updating, or removing data swiftly and accurately. Such systems guarantee that changes are executed correctly, or no alterations occur if an error arises. This reliability is vital for critical business applications where data precision must be maintained.</p><h3 id="from-warehouses-to-new-horizons">From Warehouses to New Horizons</h3><p>Initially, data warehouses were tailored for fixed data formats. They excelled at detailed analytics but struggled as diverse data sources emerged. Their rigid structure proved expensive and inefficient for agile&#xA0;<a href="https://ilum.cloud/blog/?ref=blog.ilum.cloud">data analytics</a>&#xA0;needs. As businesses expanded, so did their data requirements, prompting the advent of large-scale data storage solutions.</p><h3 id="the-arrival-of-data-lakes">The Arrival of Data Lakes</h3><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/Introduction-of-Data-Lakes.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/Introduction-of-Data-Lakes.webp 600w,/blog/content/images/size/w1000/2024/09/Introduction-of-Data-Lakes.webp 1000w,/blog/content/images/size/w1600/2024/09/Introduction-of-Data-Lakes.webp 1600w,/blog/content/images/2024/09/Introduction-of-Data-Lakes.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>Data lakes transformed how extensive data collections were managed. These solutions allowed organizations to store vast raw data without immediate organization, catering to diverse inputs like web logs and IoT feeds. A key advantage was the low cost of storage, although maintaining&#xA0;<a href="https://ilum.cloud/product/features?ref=blog.ilum.cloud">data quality</a>&#xA0;and reliability were challenges that arose.</p><h3 id="what-is-a-data-lake">What is a Data Lake?</h3><p>A data lake serves as a vast repository where raw data is stored until needed. Unlike warehouses requiring pre-organization, data lakes adopt a &quot;schema-on-read&quot; approach. This flexibility is beneficial for data scientists and analysts, allowing examination and interpretation without fixed structures.</p><h3 id="benefits-of-large-data-repositories">Benefits of Large Data Repositories</h3><ul><li><strong>Scalability</strong>: They manage substantial data without significant infrastructure changes.</li><li><strong>Cost Efficiency</strong>: Storage in data lakes is more affordable, reducing operational expenses.</li><li><strong>Diverse Data Support</strong>: They accommodate structured, semi-structured, and unstructured data effectively, making them versatile for various analytics needs.</li></ul><p>By evolving from traditional systems while incorporating the versatility of lakes, the lakehouse concept provides a modern approach to managing and analyzing data, merging the best of both foundational methods.</p><h2 id="recap-from-data-lake-to-data-swamp">Recap: From Data Lake to Data Swamp</h2><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/From-Data-Lake-to-Data-Swamp.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/From-Data-Lake-to-Data-Swamp.webp 600w,/blog/content/images/size/w1000/2024/09/From-Data-Lake-to-Data-Swamp.webp 1000w,/blog/content/images/size/w1600/2024/09/From-Data-Lake-to-Data-Swamp.webp 1600w,/blog/content/images/2024/09/From-Data-Lake-to-Data-Swamp.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>Building a good data lakehouse definitely has its challenges. In the beginning, businesses were all in on data lakes, thinking they&#x2019;d be the magic solution to all their storage problems. But without proper management, these lakes can turn into data swamps, where it&#x2019;s way harder to dig out anything useful.</p><h3 id="what-exactly-is-a-data-swamp">What Exactly is a Data Swamp?</h3><p>When businesses first embraced data lakes, they hoped for an ideal solution to their storage issues. But without proper structure and oversight, these data lakes can become chaotic data collections, or swamps. In such a state, finding useful information becomes a challenge. Here are some of the problems:</p><ul><li><strong>Duplicate Data</strong>: Copies of data can accumulate, leading to confusion and higher storage costs.</li><li><strong>Poor Data Quality</strong>: Inaccurate data leads to wrong decisions, impacting overall business performance.</li><li><strong>Regulatory Issues</strong>: Mismanaged data can mean failing to meet legal&#xA0;<a href="https://ilum.cloud/resources/support?ref=blog.ilum.cloud">data protection</a>&#xA0;standards.</li></ul><p>Data silos and data staleness often emerge from these disorganized repositories, leading to isolated datasets and outdated information which further hamper our ability to make timely decisions.</p><h3 id="characteristics-of-a-data-lakehouse">Characteristics of a Data Lakehouse</h3><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/The-Significance-of-Data-Lakehouse.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/The-Significance-of-Data-Lakehouse.webp 600w,/blog/content/images/size/w1000/2024/09/The-Significance-of-Data-Lakehouse.webp 1000w,/blog/content/images/size/w1600/2024/09/The-Significance-of-Data-Lakehouse.webp 1600w,/blog/content/images/2024/09/The-Significance-of-Data-Lakehouse.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>To counter these issues, the data lakehouse concept emerged, offering a more balanced approach to data management. This system allows us to store vast amounts of raw data, providing flexibility for analysts and data scientists. Unlike older systems, it aligns with modern data science and machine learning needs, facilitating advanced analytics.</p><p>The data lakehouse combines elements from both data lakes and warehouses. Let&#x2019;s explore its features:</p><ul><li><strong>Reliable Transactions</strong>: Supports transactions, ensuring data is accurate and dependable.</li><li><strong>Structured Data</strong>: Uses schema enforcement to keep data organized and reliable.</li><li><strong>Separate Storage and Processing</strong>: Decouples storage and compute, optimizing efficiency.</li><li><strong>Flexible Formats</strong>: Compatible with open table formats like Delta, Iceberg, and Hudi.</li><li><strong>Versatile Data Handling</strong>: Handles structured, semi-structured, and unstructured data.</li><li><strong>Real-Time Streaming</strong>: Fully supports streaming, enabling up-to-date analytics.</li></ul><p>These features address the limitations of traditional systems, allowing us to work with data more effectively. By capitalizing on these strengths, we can position ourselves well in an increasingly data-driven world.</p><h2 id="data-governance-in-data-lakehouses">Data Governance in Data Lakehouses</h2><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/Data-Governance-in-Data-Lakehouses.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/Data-Governance-in-Data-Lakehouses.webp 600w,/blog/content/images/size/w1000/2024/09/Data-Governance-in-Data-Lakehouses.webp 1000w,/blog/content/images/size/w1600/2024/09/Data-Governance-in-Data-Lakehouses.webp 1600w,/blog/content/images/2024/09/Data-Governance-in-Data-Lakehouses.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>Data governance in a lakehouse setup is crucial for maintaining accuracy, accessibility, and security, while also complying with regulations. We ensure that our data remains reliable by focusing on several aspects:</p><ul><li><strong>Data Catalog</strong>: We organize all data and metadata, allowing for easy discovery and retrieval.</li><li><strong>Accountability and Quality</strong>: Our&#xA0;<a href="https://ilum.cloud/product/about-us?ref=blog.ilum.cloud">data stewards</a>&#xA0;are responsible for maintaining data quality and consistency.</li><li><strong>Controlled Access</strong>: By implementing role-based access, we make sure only authorized individuals can view sensitive information.</li></ul><p>These practices help us maintain a flexible and interoperable data environment, ensuring privacy and consistency.</p><h2 id="comparing-data-lakehouses-and-data-warehouses">Comparing Data Lakehouses and Data Warehouses</h2><p>The architecture of a data lakehouse offers unique advantages over traditional data warehouses. While warehouses are tailored for structured data and excel in analytics, lakehouses provide flexibility by allowing both structured and unstructured data to coexist. This approach gives organizations the ability to leverage diverse data types efficiently.</p><p><strong>Key Differences:</strong></p><ul><li><strong>Data Storage:</strong>&#xA0;Warehouses require data to be structured before storage, while lakehouses can keep raw data, processing it as needed.</li><li><strong>Query Performance:</strong>&#xA0;Warehouses excel in complex structured data queries, whereas lakehouses support varied data types with faster queries using tools like Apache Spark.</li><li><strong>Cost:</strong>&#xA0;Lakehouses often use economical storage, reducing costs compared to the high-performance storage required by warehouses.</li><li><strong>Scalability:</strong>&#xA0;Lakehouses scale easily with additional storage nodes, unlike warehouses that have scalability limits as data sizes increase.</li></ul><h3 id="schema-evolution-in-data-lakehouses">Schema Evolution in Data Lakehouses</h3><figure class="kg-card kg-image-card kg-card-hascaption"><img src="/blog/content/images/2024/09/Schema-Evolution-in-Data-Lakehouses.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/Schema-Evolution-in-Data-Lakehouses.webp 600w,/blog/content/images/size/w1000/2024/09/Schema-Evolution-in-Data-Lakehouses.webp 1000w,/blog/content/images/size/w1600/2024/09/Schema-Evolution-in-Data-Lakehouses.webp 1600w,/blog/content/images/2024/09/Schema-Evolution-in-Data-Lakehouses.webp 1792w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Ilum - Schema Evolution in Data Lakehouses</span></figcaption></figure><p>Schema evolution is very important because it lets businesses adjust their data setup without messing up their current workflows. And honestly, in today&#x2019;s fast-moving data world, that kind of flexibility is a must.</p><h3 id="embracing-new-standards">Embracing New Standards</h3><p>Previously, changing database schemas, such as adding columns or altering structures, was complicated and could lead to downtime. With lakehouses, schema changes are straightforward and built into the system. This enables our teams to adapt quickly to new data requirements, maintaining efficient operations.</p><h3 id="making-the-system-effective">Making the System Effective</h3><ul><li><strong>Version Control:</strong>&#xA0;We track dataset versions to accommodate changes while supporting older formats.</li><li><strong>Automated Schema Recognition:</strong>&#xA0;Employing tools that detect schema alterations ensures our&#xA0;<a href="https://ilum.cloud/hadoop-migration?ref=blog.ilum.cloud">data processing</a>&#xA0;workflows remain fluid.</li><li><strong>Data Scrutiny:</strong>&#xA0;By implementing validation rules, we ensure any incoming data conforms to expected formats, preventing processing issues.</li></ul><p>Using these strategies, we can make our data systems more responsive and robust, handling the evolving demands of data management effectively.</p><h2 id="keeping-your-data-secure-and-ready-why-its-important">Keeping Your Data Secure and Ready: Why It&apos;s Important</h2><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/10/ilum-data-safety.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/10/ilum-data-safety.webp 600w,/blog/content/images/size/w1000/2024/10/ilum-data-safety.webp 1000w,/blog/content/images/size/w1600/2024/10/ilum-data-safety.webp 1600w,/blog/content/images/2024/10/ilum-data-safety.webp 1792w" sizes="(min-width: 720px) 720px"></figure><h3 id="the-role-of-cloud-storage">The Role of Cloud Storage</h3><p>Cloud object storage plays a vital role in ensuring our data stays safe and accessible. This type of storage keeps our digital assets&#x2014;whether structured business data or varied media files&#x2014;well-organized and secure. Features such as backups and versioning are essential because they offer peace of mind. If any data becomes corrupted or lost, we can swiftly restore it, helping us avoid potential disruptions.</p><h3 id="flexible-open-data-formats">Flexible Open Data Formats</h3><p>Open data standards are crucial for data flexibility. By using formats like Parquet or ORC, we ensure our data remains adaptable. This way, we&apos;re not tied to a single tool or provider, which means we can adjust our systems as needed. This flexibility is key to making sure our data can be utilized efficiently across different platforms and tools.</p><h3 id="business-benefits-of-reliable-data-management">Business Benefits of Reliable Data Management</h3><p>A well-structured data environment using cloud object storage and open formats is advantageous for any business. It guarantees our business data is both secure and accessible when needed. Whether we manage structured data sets or varied media content, we gain the flexibility and reliability necessary for our operations. As our business evolves or the volume of data grows, having a setup that adapts to these changes is essential. This approach ensures we can keep pace with our data needs and maintain smooth business operations.</p><h2 id="the-future-of-data-lakehouses">The Future of Data Lakehouses</h2><p>Data architecture is continuing to grow and adapt to the increasing demands of data analytics and data science. As more companies dive into AI and machine learning, having a solid and flexible data setup is going to be crucial.</p><h3 id="connecting-with-ai-and-machine-learning">Connecting with AI and Machine Learning</h3><figure class="kg-card kg-image-card"><img src="/blog/content/images/2024/09/The-Future-of-Data-Lakehouses.webp" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="1792" height="1024" srcset="/blog/content/images/size/w600/2024/09/The-Future-of-Data-Lakehouses.webp 600w,/blog/content/images/size/w1000/2024/09/The-Future-of-Data-Lakehouses.webp 1000w,/blog/content/images/size/w1600/2024/09/The-Future-of-Data-Lakehouses.webp 1600w,/blog/content/images/2024/09/The-Future-of-Data-Lakehouses.webp 1792w" sizes="(min-width: 720px) 720px"></figure><p>Data lakehouses provide a strong foundation for tasks like&#xA0;<a href="https://ilum.cloud/get-access?ref=blog.ilum.cloud">machine learning</a>. By merging structured and unstructured data on a single platform, we can streamline the workflow of data scientists. This setup helps in both developing and deploying machine learning models effectively, enhancing our&#xA0;data science&#xA0;capabilities.</p><h3 id="what-lies-ahead">What Lies Ahead?</h3><p>With ongoing tech progress, data lakehouses will continue to evolve. We anticipate enhancements such as automated data governance, improved security measures, and performance-boosting tools. These updates will reinforce the role of data lakehouses in&#xA0;<a href="https://www.dataversity.net/data-lakehouses-the-future-of-data-migration/?ref=blog.ilum.cloud">modern data strategies</a>, ensuring they remain integral to our efforts in managing and analyzing data efficiently.</p><figure class="kg-card kg-image-card"><a href="https://ilum.cloud/?ref=blog.ilum.cloud"><img src="/blog/content/images/2024/09/ilum-logo2-2.svg" class="kg-image" alt="Data Lakehouse: Transforming Enterprise Data Management" loading="lazy" width="147" height="58"></a></figure><h2 id="why-ilum-is-a-perfect-example-of-a-well-defined-data-lakehouse">Why Ilum is a Perfect Example of a Well-Defined Data Lakehouse</h2><figure class="kg-card kg-embed-card"><iframe width="200" height="113" src="https://www.youtube.com/embed/6d27js-FNHU?feature=oembed" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen title="Ilum - Modular Data Lakehouse for a Cloud Native World"></iframe></figure><p>Ilum embodies what a data lakehouse should be, harmonizing the versatility of data lakes with the comprehensive control of data warehouses. Let&apos;s delve into the reasons why Ilum stands out in this space.</p><ul><li><strong>Unified Multi-Cluster Management</strong><br>Our platform simplifies the management of multiple Spark clusters whether they are cloud-based or on-premise. This feature ensures seamless data handling across different environments.</li><li><strong>Kubernetes and Hadoop Flexibility</strong><br>Ilum supports both Kubernetes and Hadoop Yarn, offering businesses the choice to manage their Spark clusters in a way that suits them best. This flexibility empowers companies to transition from traditional Hadoop setups to modern, cloud-native environments, adapting to today&apos;s technology-driven landscape.</li><li><strong>Interactive Spark Sessions and&#xA0;</strong><a href="https://ilum.cloud/docs/api/?ref=blog.ilum.cloud"><strong>REST API</strong></a><br>By utilizing our REST API for Spark jobs, Ilum enhances interactivity, allowing for real-time data operations. This not only elevates the data platform experience but also enables the creation of dynamic applications that respond instantly to user requests&#x2014;an essential feature for advanced data lakehouses.</li><li><strong>Open-Source and Free Accessibility</strong><br>A remarkable trait of Ilum is its&#xA0;<a href="https://ilum.cloud/pricing?ref=blog.ilum.cloud">cost-efficiency</a>, as it is available at no expense. Utilizing open-source tools such as Apache Spark, Jupyter, and Apache Ranger, Ilum avoids vendor lock-in, making it an attractive option for startups and enterprises alike to explore data lakehouse architecture without hefty costs.</li></ul><p>The strengths of Ilum lie in its scalability, flexibility, real-time interactivity, and affordability. It caters to those who seek a well-architected data lakehouse that doesn&apos;t compromise performance or governance. Embracing Ilum&apos;s advanced features empowers us to fully leverage the potential of a modern data lakehouse solution, truly blending the benefits of both data lakes and warehouses.</p><h2 id="frequently-asked-questions">Frequently Asked Questions</h2><h3 id="what-are-the-main-components-of-a-data-lakehouse">What are the Main Components of a Data Lakehouse?</h3><p>Data lakehouses combine elements of both data lakes and data warehouses. Key components include a storage layer that handles large volumes of structured and unstructured data, a processing layer for executing data queries and transformations, and a management layer to maintain data organization and governance.</p><h3 id="how-does-data-lakehouse-performance-compare-to-traditional-data-warehouses">How Does Data Lakehouse Performance Compare to Traditional Data Warehouses?</h3><p>Data lakehouses often have enhanced performance due to their capability to handle diverse data types and perform complex queries. They integrate the flexible storage from data lakes with the efficient query performance of data warehouses, offering a balanced approach to data storage and computation.</p><h3 id="what-are-the-advantages-of-using-a-data-lakehouse-for-data-analysis">What are the Advantages of Using a Data Lakehouse for Data Analysis?</h3><p>Using a data lakehouse can streamline data analytics by providing a single platform that supports both storage and analytics. This integration reduces data movement and duplication, enabling faster insights and more efficient data management. Moreover, data lakehouses offer scalability and flexibility, essential for handling large data sets.</p><h3 id="what-tools-and-technologies-are-common-in-building-a-data-lakehouse">What Tools and Technologies Are Common in Building a Data Lakehouse?</h3><p>Common tools include Apache Spark for processing large data sets and Delta Lake for offering reliable data indexing and version control. Technologies like cloud storage services and data governance tools are integral in managing large-scale data lakehouses efficiently.</p><h3 id="how-do-data-lakehouses-manage-data-security-and-governance">How Do Data Lakehouses Manage Data Security and Governance?</h3><p>Data governance and security are managed by implementing robust authentication protocols, encryption techniques, and data masking. This ensures that only authorized users can access sensitive information, safeguarding the data integrity and privacy within the lakehouse environment.</p><h3 id="when-is-a-data-lakehouse-preferred-over-a-data-lake">When is a Data Lakehouse Preferred Over a Data Lake?</h3><p>A data lakehouse is preferred when there is a need to support both analytics workloads and traditional operational query workloads on diverse data types. It is ideal for organizations requiring a unified system that reduces data silos and simplifies data management processes.</p>]]></content:encoded></item><item><title><![CDATA[Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum.]]></title><description><![CDATA[<p></p><p>Greetings Ilum enthusiasts and Python fans! We&apos;re thrilled to unveil a new, eagerly expected feature that&apos;s set to empower your data science journey - full Python support in Ilum. For those in the data world, Python and Apache Spark have long been an iconic duo, seamlessly</p>]]></description><link>https://blog.ilum.cloud/deploying-pyspark-microservices-on-kubernetes-revolutionizing-data-lakes/</link><guid isPermaLink="false">64bee5dba5976e0001aaa6f4</guid><category><![CDATA[News]]></category><dc:creator><![CDATA[Ilum]]></dc:creator><pubDate>Thu, 27 Jul 2023 13:00:55 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2023/07/ilum-ferret3.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2023/07/ilum-ferret3.png" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum."><p></p><p>Greetings Ilum enthusiasts and Python fans! We&apos;re thrilled to unveil a new, eagerly expected feature that&apos;s set to empower your data science journey - full Python support in Ilum. For those in the data world, Python and Apache Spark have long been an iconic duo, seamlessly handling vast volumes of data and complex computations. And now, with Ilum&apos;s latest upgrade, you can harness the power of Python right inside your favourite data lake environment.</p><p>This blog post is your guided tour to exploring this feature. We&apos;ll kick things off with a simple Apache Spark job written in Python, run it on Ilum, and then dive deeper. We&apos;ll transform the initial code to support an interactive mode, offering you direct access to the Spark job via Ilum&apos;s API. By the end of this journey, you&apos;ll have a Python-based microservice responding to API calls, all running smoothly on Ilum.</p><p>So, are you ready to enhance your data game with Python and Ilum? Let&apos;s get started.</p><p>All examples are available on our <a href="https://github.com/ilum-cloud/ilum-python-examples?ref=blog.ilum.cloud">GitHub repository</a>. </p><h2 id="step-1-writing-a-simple-apache-spark-job-in-python">Step 1: Writing a Simple Apache Spark Job in Python.</h2><p>Before we embark on our Python journey with Ilum, we need to ensure our environment is well-equipped. To run a Spark job, you need to have Ilum and PySpark installed. You can use pip, the Python package installer, to set up PySpark. Make sure you&apos;re using Python &gt;=3.9.</p><pre><code class="language-bash">pip install pyspark</code></pre><p>For setting up and accessing Ilum, please follow the guidelines provided <a href="https://ilum.cloud/blog/spark-on-kubernetes/?ref=blog.ilum.cloud">here</a>.</p><h3 id="11-sparkpi-example">1.1 SparkPi example.</h3><p>Now, let&apos;s dive into writing our Spark job. We&apos;ll start with a simple example of SparkPi</p><pre><code class="language-python">import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

if __name__ == &quot;__main__&quot;:
    spark = SparkSession \
        .builder \
        .appName(&quot;PythonPi&quot;) \
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) &gt; 1 else 2
    n = 100000 * partitions

    def f(_: int) -&gt; float:
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 &lt;= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print(&quot;Pi is roughly %f&quot; % (4.0 * count / n))

    spark.stop()
</code></pre><p>Save this script as <strong>ilum_python_simple.py</strong></p><p>With our Spark job ready, it&apos;s time to run it on Ilum. Ilum offers the capability to submit jobs using the Ilum UI or through the REST API.</p><p>Let&apos;s start with the UI with the <strong>single job feature.</strong></p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/ilum_python_single.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="490" height="810"></figure><p>We can achieve the same thing with the <a href="https://ilum.cloud/docs/api/?ref=blog.ilum.cloud#tag/Jobs/operation/stop%20job">API</a>, but first, we need to expose ilum-core API with port forward.</p><pre><code class="language-bash">kubectl port-forward svc/ilum-core 9888:9888</code></pre><p>With the exposed port we can make an API call.</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">curl -X POST &apos;localhost:9888/api/v1/job/submit&apos; \
        --form &apos;name=&quot;ilumSimplePythonJob&quot;&apos; \
        --form &apos;clusterName=&quot;default&quot;&apos; \
        --form &apos;jobClass=&quot;ilum_python_simple&quot;&apos; \
        --form &apos;args=&quot;10&quot;&apos; \
        --form &apos;pyFiles=@&quot;/path/to/ilum_python_simple.py&quot;&apos; \
        --form &apos;language=&quot;PYTHON&quot;&apos;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">API call</span></p></figcaption></figure><p>As a result, we will receive the id of the created job.</p><figure class="kg-card kg-code-card"><pre><code class="language-json">{&quot;jobId&quot;:&quot;20230724-1154-m78f3gmlo5j&quot;}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Result</span></p></figcaption></figure><p>To check the logs of the job we can make an API call to </p><figure class="kg-card kg-code-card"><pre><code class="language-bash">curl localhost:9888/api/v1/job/20230724-1154-m78f3gmlo5j/logs</code></pre><figcaption><p><span style="white-space: pre-wrap;">API call</span></p></figcaption></figure><p>And that&apos;s it! You&apos;ve written and run a simple Python Spark job on Ilum. Let&apos;s look at a little more advanced example which needs additional Python libraries.</p><h3 id="12-job-example-with-numpy">1.2 Job example with numpy.</h3><p>In this section, we&apos;ll go over a practical example of a Spark job written in Python. This job involves reading a dataset, processing it, training a machine learning model on it, and saving the predictions. We&apos;re going to use a <strong>Tel-churn.csv</strong> file, which you can find in our <a href="https://github.com/ilum-cloud/ilum-python-examples/blob/main/Tel-churn.csv?ref=blog.ilum.cloud">GitHub repository</a>. To make things easy, we&apos;ve uploaded this file to a bucket named ilum-files in the build-in instance of MinIO, which is automatically accessible from the Ilum instance. This means you won&apos;t have to worry about configuring any accesses for this example - Ilum has got it covered. However, if you ever want to fetch data from a different bucket or use Amazon S3 in your own projects, you&apos;ll need to configure the accesses accordingly.</p><p>Now that we&apos;ve got our data ready, let&apos;s get started with writing our Spark job in Python. Here is the full code example:</p><pre><code class="language-python">from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

if __name__ == &quot;__main__&quot;:

    spark = SparkSession \
        .builder \
        .appName(&quot;IlumAdvancedPythonExample&quot;) \
        .getOrCreate()
    
    df = spark.read.csv(&apos;s3a://ilum-files/Tel-churn.csv&apos;, header=True, inferSchema=True)

    categoricalColumns = [&apos;gender&apos;, &apos;Partner&apos;, &apos;Dependents&apos;, &apos;PhoneService&apos;, &apos;MultipleLines&apos;, &apos;InternetService&apos;,
                          &apos;OnlineSecurity&apos;, &apos;OnlineBackup&apos;, &apos;DeviceProtection&apos;, &apos;TechSupport&apos;, &apos;StreamingTV&apos;,
                          &apos;StreamingMovies&apos;, &apos;Contract&apos;, &apos;PaperlessBilling&apos;, &apos;PaymentMethod&apos;]

    stages = []

    for categoricalCol in categoricalColumns:
        stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + &quot;Index&quot;)
        stages += [stringIndexer]

    label_stringIdx = StringIndexer(inputCol=&quot;Churn&quot;, outputCol=&quot;label&quot;)
    stages += [label_stringIdx]

    numericCols = [&apos;SeniorCitizen&apos;, &apos;tenure&apos;, &apos;MonthlyCharges&apos;]

    assemblerInputs = [c + &quot;Index&quot; for c in categoricalColumns] + numericCols
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol=&quot;features&quot;)
    stages += [assembler]

    pipeline = Pipeline(stages=stages)
    pipelineModel = pipeline.fit(df)
    df = pipelineModel.transform(df)

    train, test = df.randomSplit([0.7, 0.3], seed=42)

    lr = LogisticRegression(featuresCol=&quot;features&quot;, labelCol=&quot;label&quot;, maxIter=10)
    lrModel = lr.fit(train)

    predictions = lrModel.transform(test)

    predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).show(5)
    predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).write.option(&quot;header&quot;, &quot;true&quot;) \
        .csv(&apos;s3a://ilum-files/predictions&apos;)

    spark.stop()
</code></pre><p>Let&apos;s dive into the code:</p><pre><code class="language-python">from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
</code></pre><p>Here, we&apos;re importing the necessary PySpark modules to create a Spark session, build a machine learning pipeline, preprocess the data, and run a Logistic Regression model.</p><pre><code class="language-python">spark = SparkSession \
    .builder \
    .appName(&quot;IlumAdvancedPythonExample&quot;) \
    .getOrCreate()
</code></pre><p>We initialize a <code>SparkSession</code>, which is the entry point to any functionality in Spark. This is where we set the application name that will appear on the Spark web UI.</p><pre><code class="language-python">df = spark.read.csv(&apos;s3a://ilum-files/Tel-churn.csv&apos;, header=True, inferSchema=True)
</code></pre><p>We&apos;re reading a CSV file stored on an minio bucket. The <code>header=True</code> option tells Spark to use the first row of the CSV file as headers, while <code>inferSchema=True</code> makes Spark automatically determine the data type of each column.</p><pre><code class="language-python">categoricalColumns = [&apos;gender&apos;, &apos;Partner&apos;, &apos;Dependents&apos;, &apos;PhoneService&apos;, &apos;MultipleLines&apos;, &apos;InternetService&apos;,
                      &apos;OnlineSecurity&apos;, &apos;OnlineBackup&apos;, &apos;DeviceProtection&apos;, &apos;TechSupport&apos;, &apos;StreamingTV&apos;,
                      &apos;StreamingMovies&apos;, &apos;Contract&apos;, &apos;PaperlessBilling&apos;, &apos;PaymentMethod&apos;]
</code></pre><p>We specify the columns in our data that are categorical. These will be transformed later using a StringIndexer.</p><pre><code class="language-python">stages = []

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + &quot;Index&quot;)
    stages += [stringIndexer]
</code></pre><p>Here, we&apos;re iterating over our list of categorical columns and creating a StringIndexer for each. StringIndexers encode categorical string columns into a column of indices. The transformed index column will be named as the original column name appended with &quot;Index&quot;.</p><pre><code class="language-python">numericCols = [&apos;SeniorCitizen&apos;, &apos;tenure&apos;, &apos;MonthlyCharges&apos;]

assemblerInputs = [c + &quot;Index&quot; for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol=&quot;features&quot;)
stages += [assembler]
</code></pre><p>Here we prepare the data for our machine learning model. We create a VectorAssembler which will take all our feature columns (both categorical and numerical) and assemble them into a single vector column. This is a requirement for most machine learning algorithms in Spark.</p><pre><code class="language-python">train, test = df.randomSplit([0.7, 0.3], seed=42)
</code></pre><p>We split our data into a training set and a test set, with 70% of the data for training and the remaining 30% for testing.</p><pre><code class="language-python">lr = LogisticRegression(featuresCol=&quot;features&quot;, labelCol=&quot;label&quot;, maxIter=10)
lrModel = lr.fit(train)
</code></pre><p>We train a Logistic Regression model on our training data.</p><pre><code class="language-python">predictions = lrModel.transform(test)

predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).show(5)
predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).write.option(&quot;header&quot;, &quot;true&quot;) \
    .csv(&apos;s3a://ilum-files/predictions&apos;)</code></pre><p>Lastly, we use our trained model to make predictions on our test set, displaying the first 5 predictions. Then we write these predictions back to our minio bucket.</p><p>Save this script as <strong>ilum_python_advanced.py</strong></p><p>pyspark.ml uses numpy as a dependency which is not installed as default so we need to specify it as a requirement.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/ilum-advanced-python-single.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="488" height="806"></figure><p>And the same thing can be done through the API.</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">curl -X POST &apos;localhost:9888/api/v1/job/submit&apos; \
        --form &apos;name=&quot;IlumAdvancedPythonExample&quot;&apos; \
        --form &apos;clusterName=&quot;default&quot;&apos; \
        --form &apos;jobClass=&quot;ilum_python_advanced&quot;&apos; \
        --form &apos;pyRequirements=&quot;numpy&quot;&apos; \
        --form &apos;pyFiles=@&quot;/path/to/ilum_python_advanced.py&quot;&apos; \
        --form &apos;language=&quot;PYTHON&quot;&apos;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">API call</span></p></figcaption></figure><p>In the next sections, we&apos;ll transform both Python scripts into an <strong>interactive</strong> Spark job, taking full advantage of Ilum&apos;s capabilities.</p><h2 id="step-2-transitioning-to-interactive-mode">Step 2: Transitioning to Interactive Mode<br></h2><p>Interactive mode is an exciting feature that makes Spark development more dynamic, giving you the capability to run, interact, and control your Spark jobs in real time. It&apos;s designed for those who seek more direct control over their Spark applications.</p><p>Think of Interactive mode as having a direct conversation with your Spark job. You can feed in data, request transformations, and fetch results - all in real time. This drastically enhances the agility and capability of your data processing pipeline, making it more adaptable and responsive to changing requirements.</p><p>Now that we&apos;re familiar with creating a basic Spark job in Python, let&apos;s take things a step further by transforming our job into an interactive one that can take advantage of Ilum&apos;s real-time capabilities.</p><h3 id="21-sparkpi-example">2.1 SparkPi example.</h3><p>To illustrate how to transition our job to Interactive mode, we will adjust our earlier <strong>ilum_python_simple.py</strong> script. </p><pre><code class="language-python">from random import random
from operator import add

from ilum.api import IlumJob


class SparkPiInteractiveExample(IlumJob):

    def run(self, spark, config):
        partitions = int(config.get(&apos;partitions&apos;, &apos;5&apos;))
        n = 100000 * partitions

        def f(_: int) -&gt; float:
            x = random() * 2 - 1
            y = random() * 2 - 1
            return 1 if x ** 2 + y ** 2 &lt;= 1 else 0

        count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

        return &quot;Pi is roughly %f&quot; % (4.0 * count / n)
</code></pre><p>Save this as <strong>ilum_python_simple_interactive.py</strong></p><p>There are just a few differences from the original SparkPi.</p><p>1.<strong>	Ilum package</strong></p><p>To start off, we import the <code>IlumJob</code> class from the ilum package, which serves as a base class for our interactive job.</p><p>The Spark job logic is encapsulated in a class that extends <code>IlumJob</code>, particularly within its <code>run</code> method. We can add ilum package with:</p><pre><code class="language-bash">pip install ilum</code></pre><p>2.	<strong>Spark job in a class</strong></p><p>The Spark job logic is encapsulated in a class that extends <code>IlumJob</code>, particularly within its <code>run</code> method.</p><pre><code class="language-python">class SparkPiInteractiveExample(IlumJob):
    def run(self, spark, config):
        # Job logic here
</code></pre><p>Wrapping the job logic in a class is essential for the Ilum framework to handle the job and its resources. This also makes the job stateless and reusable.</p><p>3.<strong>	Parameters are handled differently:</strong></p><p>We are taking all arguments from the config dictionary</p><pre><code class="language-python">partitions = int(config.get(&apos;partitions&apos;, &apos;5&apos;))
</code></pre><p>This shift allows for more dynamic parameter passing and integrates with Ilum&apos;s configuration handling.</p><p>4.	<strong>The result is returned instead of printed:</strong></p><p>The result is returned from the <code>run</code> method.</p><pre><code class="language-python">return &quot;Pi is roughly %f&quot; % (4.0 * count / n)
</code></pre><p>By returning the result, Ilum can handle it in a more flexible way. For instance, Ilum could serialize the result and make it accessible via an API call.</p><p>5.	<strong>No need to manually manage Spark session</strong></p><p>Ilum manages the Spark session for us. It&apos;s automatically injected into the <code>run</code> method and we don&apos;t need to stop it manually.</p><pre><code class="language-python">def run(self, spark, config):
</code></pre><p>These changes highlight the transition from a standalone Spark job to an interactive Ilum job. The goal is to improve the flexibility and reusability of the job, making it more suited for dynamic, interactive, and on-the-fly computations.</p><p>Adding interactive spark job is handled with the &apos;new group&apos; function.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/SparkPiInteractiveExample.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="490" height="863"></figure><p>And the execution with the interactive job function on UI.<br>The class name should be specified as a <code>pythonFileName.PythonClassImplementingIlumJob</code></p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/interactive-pi.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="1123" height="826" srcset="/blog/content/images/size/w600/2023/07/interactive-pi.png 600w,/blog/content/images/size/w1000/2023/07/interactive-pi.png 1000w,/blog/content/images/2023/07/interactive-pi.png 1123w" sizes="(min-width: 720px) 720px"></figure><p>We can achieve the same thing with the <a href="https://ilum.cloud/docs/api/?ref=blog.ilum.cloud#tag/Jobs/operation/stop%20job">API</a>.<br><br>1.	Creating group</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">curl -X POST &apos;localhost:9888/api/v1/group&apos; \
        --form &apos;name=&quot;SparkPiInteractiveExample&quot;&apos; \
        --form &apos;kind=&quot;JOB&quot;&apos; \
        --form &apos;clusterName=&quot;default&quot;&apos; \
        --form &apos;pyFiles=@&quot;/path/to/ilum_python_simple_interactive.py&quot;&apos; \
        --form &apos;language=&quot;PYTHON&quot;&apos;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">API call</span></p></figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-json">{&quot;groupId&quot;:&quot;20230726-1638-mjrw3&quot;}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Result</span></p></figcaption></figure><p>2.	Job execution</p><figure class="kg-card kg-code-card"><pre><code>curl -X POST &apos;localhost:9888/api/v1/group/20230726-1638-mjrw3/job/execute&apos; \
	-H &apos;Content-Type: application/json&apos; \
	-d &apos;{ &quot;jobClass&quot;:&quot;ilum_python_simple_interactive.SparkPiInteractiveExample&quot;, &quot;jobConfig&quot;: {&quot;partitions&quot;:&quot;10&quot;}, &quot;type&quot;:&quot;interactive_job_execute&quot;}&apos;</code></pre><figcaption><p><span style="white-space: pre-wrap;">API call</span></p></figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-json">{
   &quot;jobInstanceId&quot;:&quot;20230726-1638-mjrw3-a1srahhu&quot;,
   &quot;jobId&quot;:&quot;20230726-1638-mjrw3-wwt5a&quot;,
   &quot;groupId&quot;:&quot;20230726-1638-mjrw3&quot;,
   &quot;startTime&quot;:1690390323154,
   &quot;endTime&quot;:1690390325200,
   &quot;jobClass&quot;:&quot;ilum_python_simple_interactive.SparkPiInteractiveExample&quot;,
   &quot;jobConfig&quot;:{
      &quot;partitions&quot;:&quot;10&quot;
   },
   &quot;result&quot;:&quot;Pi is roughly 3.149400&quot;,
   &quot;error&quot;:null
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Result</span></p></figcaption></figure><h3 id="22-job-example-with-numpy">2.2 Job example with numpy.</h3><p>Let&apos;s look at our second example.</p><pre><code class="language-python">from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression

from ilum.api import IlumJob


class LogisticRegressionJobExample(IlumJob):

    def run(self, spark_session: SparkSession, config: dict) -&gt; str:
        df = spark_session.read.csv(config.get(&apos;inputFilePath&apos;, &apos;s3a://ilum-files/Tel-churn.csv&apos;), header=True,
                                    inferSchema=True)

        categoricalColumns = [&apos;gender&apos;, &apos;Partner&apos;, &apos;Dependents&apos;, &apos;PhoneService&apos;, &apos;MultipleLines&apos;, &apos;InternetService&apos;,
                              &apos;OnlineSecurity&apos;, &apos;OnlineBackup&apos;, &apos;DeviceProtection&apos;, &apos;TechSupport&apos;, &apos;StreamingTV&apos;,
                              &apos;StreamingMovies&apos;, &apos;Contract&apos;, &apos;PaperlessBilling&apos;, &apos;PaymentMethod&apos;]

        stages = []

        for categoricalCol in categoricalColumns:
            stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + &quot;Index&quot;)
            stages += [stringIndexer]

        label_stringIdx = StringIndexer(inputCol=&quot;Churn&quot;, outputCol=&quot;label&quot;)
        stages += [label_stringIdx]

        numericCols = [&apos;SeniorCitizen&apos;, &apos;tenure&apos;, &apos;MonthlyCharges&apos;]

        assemblerInputs = [c + &quot;Index&quot; for c in categoricalColumns] + numericCols
        assembler = VectorAssembler(inputCols=assemblerInputs, outputCol=&quot;features&quot;)
        stages += [assembler]

        pipeline = Pipeline(stages=stages)
        pipelineModel = pipeline.fit(df)
        df = pipelineModel.transform(df)

        train, test = df.randomSplit([float(config.get(&apos;splitX&apos;, &apos;0.7&apos;)), float(config.get(&apos;splitY&apos;, &apos;0.3&apos;))],
                                     seed=int(config.get(&apos;seed&apos;, &apos;42&apos;)))

        lr = LogisticRegression(featuresCol=&quot;features&quot;, labelCol=&quot;label&quot;, maxIter=int(config.get(&apos;maxIter&apos;, &apos;5&apos;)))
        lrModel = lr.fit(train)

        predictions = lrModel.transform(test)

        return &apos;{}&apos;.format(predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).limit(
            int(config.get(&apos;rowLimit&apos;, &apos;5&apos;))).toJSON().collect())
</code></pre><p>1. <strong>	We wrap the job in a class, just like in the previous example:</strong></p><pre><code class="language-python">class LogisticRegressionJobExample(IlumJob):
    def run(self, spark_session: SparkSession, config: dict) -&gt; str:
        # Job logic here
</code></pre><p>Again, the job logic is encapsulated in the <code>run</code> method of a class extending <code>IlumJob</code>, helping Ilum to handle the job efficiently.</p><p>2.	<strong>All parameters, including those for the data pipeline (like file paths and Logistic Regression hyperparameters), are obtained from the <code>config</code> dictionary:</strong></p><pre><code class="language-python">df = spark_session.read.csv(config.get(&apos;inputFilePath&apos;, &apos;s3a://ilum-files/Tel-churn.csv&apos;), header=True, inferSchema=True)
train, test = df.randomSplit([float(config.get(&apos;splitX&apos;, &apos;0.7&apos;)), float(config.get(&apos;splitY&apos;, &apos;0.3&apos;))], seed=int(config.get(&apos;seed&apos;, &apos;42&apos;)))
lr = LogisticRegression(featuresCol=&quot;features&quot;, labelCol=&quot;label&quot;, maxIter=int(config.get(&apos;maxIter&apos;, &apos;5&apos;)))
</code></pre><p>By centralizing all parameters in one place, Ilum provides a uniform, consistent way of configuring and tuning the job.</p><p>The result of the job, rather than being written to a specific location, is returned as a JSON string:</p><pre><code class="language-python">return &apos;{}&apos;.format(predictions.select(&quot;customerID&quot;, &quot;label&quot;, &quot;prediction&quot;).limit(int(config.get(&apos;rowLimit&apos;, &apos;5&apos;))).toJSON().collect())
</code></pre><p>This allows for more dynamic and flexible handling of the job result, which could then be processed further or exposed via an API, depending on the needs of the application.</p><p>This code perfectly showcases how we can seamlessly integrate PySpark jobs with Ilum to enable interactive, API-driven data processing pipelines. From simple examples like Pi approximation to more complex cases like Logistic Regression, Ilum&apos;s interactive jobs are versatile, adaptable, and efficient.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/LogisticRegressionJobExample.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="1126" height="827" srcset="/blog/content/images/size/w600/2023/07/LogisticRegressionJobExample.png 600w,/blog/content/images/size/w1000/2023/07/LogisticRegressionJobExample.png 1000w,/blog/content/images/2023/07/LogisticRegressionJobExample.png 1126w" sizes="(min-width: 720px) 720px"></figure><h2 id="step-3-making-your-spark-job-a-microservice"><br>Step 3: Making Your Spark Job a Microservice</h2><p>Microservices bring in a paradigm shift from the traditional monolithic application structure to a more modular and agile approach. By breaking down a complex application into small, loosely coupled services, it becomes easier to build, maintain, and scale each service independently based on specific requirements. When applied to our Spark job, this means we could create a robust data processing service that could be scaled, managed, and updated without affecting other parts of our application stack.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/ilum_spark_microservice.gif" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="1200" height="560" srcset="/blog/content/images/size/w600/2023/07/ilum_spark_microservice.gif 600w,/blog/content/images/size/w1000/2023/07/ilum_spark_microservice.gif 1000w,/blog/content/images/2023/07/ilum_spark_microservice.gif 1200w" sizes="(min-width: 720px) 720px"></figure><p>The power of turning your Spark job into a microservice lies in its versatility, scalability, and real-time interaction capabilities. A microservice is an independently deployable component of an application that runs as a separate process. It communicates with other components via well-defined APIs, giving you the freedom to design, develop, deploy, and scale each microservice independently.</p><p>In the context of Ilum, an interactive Spark job can be treated as a microservice. The job&apos;s &apos;run&apos; method acts as an API endpoint. Each time you call this method via Ilum&apos;s API, you&apos;re making a request to this microservice. This opens up the potential for real-time interactions with your Spark job.</p><p>You can make requests to your microservice from various applications or scripts, fetching data, and processing results on the fly. Moreover, it opens up an opportunity to build more complex, service-oriented architectures around your data processing pipelines.</p><p>One key advantage of this setup is scalability. Through the Ilum UI or API, you can scale your job (microservice) up or down based on the load or the computational complexity. You don&apos;t need to worry about manually managing resources or load balancing. Ilum&#x2019;s internal load balancer will distribute API calls between instances of your Spark job, ensuring efficient resource utilization.</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2023/07/ilum_scale_interactive_group-1.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="1770" height="541" srcset="/blog/content/images/size/w600/2023/07/ilum_scale_interactive_group-1.png 600w,/blog/content/images/size/w1000/2023/07/ilum_scale_interactive_group-1.png 1000w,/blog/content/images/size/w1600/2023/07/ilum_scale_interactive_group-1.png 1600w,/blog/content/images/2023/07/ilum_scale_interactive_group-1.png 1770w" sizes="(min-width: 1200px) 1200px"></figure><p>Keep in mind that the actual processing time of the job depends on the complexity of the Spark job and the resources allocated to it. However, with the scalability provided by Kubernetes, you can easily scale up your resources as your job&apos;s requirements grow.</p><p>This combination of Ilum, Apache Spark, and microservices brings about a new, agile way to process your data - efficiently, scalably, and responsively!</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2023/07/ilum-ferret-5-1.png" class="kg-image" alt="Deploying PySpark Microservice on Kubernetes: Revolutionizing Data Lakes with Ilum." loading="lazy" width="500" height="491"></figure><h2 id="the-game-changer-in-data-microservice-architecture">The Game-Changer in Data Microservice Architecture</h2><p>We&apos;ve come a long way since we started this journey of transforming a simple Python Apache Spark job into a full-blown microservice using Ilum. We saw how easy it was to write a Spark job, adapt it to work in interactive mode, and ultimately expose it as a microservice with the help of Ilum&apos;s robust API. Along the way, we leveraged the power of Python, the capabilities of Apache Spark, and the flexibility and scalability of Ilum. This combination has not only transformed our data processing capabilities but also changed the way we think about data architecture.</p><p>The journey doesn&apos;t stop here. With full Python support on Ilum, a new world of possibilities opens up for data processing and analytics. As we continue to build and improve on Ilum, we&apos;re excited about the future possibilities that Python brings to our platform. We believe that with Python and Ilum together, we&apos;re just at the beginning of redefining what&apos;s possible in the world of data microservice architecture.</p><p>Join us on this exciting journey, and let&apos;s shape the future of data processing together!</p>]]></content:encoded></item><item><title><![CDATA[How to optimize your Spark Cluster with Interactive Spark Jobs]]></title><description><![CDATA[This post describes how to decrease your Apache Spark job execution time. It covers the Ilum key feature of building real-time job interaction.]]></description><link>https://blog.ilum.cloud/how-to-optimize-your-spark-cluster-with-interactive-spark-jobs/</link><guid isPermaLink="false">631c7cecb1575600013b8617</guid><category><![CDATA[News]]></category><dc:creator><![CDATA[Ilum]]></dc:creator><pubDate>Wed, 14 Sep 2022 12:24:13 GMT</pubDate><media:content url="https://blog.ilum.cloud/content/images/2022/09/ilum_spark_on_kubernetes.png" medium="image"/><content:encoded><![CDATA[<img src="https://blog.ilum.cloud/content/images/2022/09/ilum_spark_on_kubernetes.png" alt="How to optimize your Spark Cluster with Interactive Spark Jobs"><p>In this article, you will learn:</p><ul><li>How to decrease your spark job execution time</li><li>What is an interactive job in Ilum</li><li>How to run an interactive spark job</li><li>Differences between running a spark job using Ilum API and Spark API</li></ul><h3 id="ilum-job-types">Ilum job types</h3><p>There are three types of jobs you can run in Ilum: <strong>single job</strong>, <strong>interactive job</strong> and <strong>interactive code</strong>. In this article, we&apos;ll focus on the <strong>interactive job</strong> type. However, it&apos;s important to know the differences between the three types of jobs, so let&apos;s take a quick overview of each one.</p><p>With <strong>single jobs</strong>, you can submit code-like programs. They allow you to submit a Spark application to the cluster, with pre-compiled code, without interaction during runtime. In this mode, you have to send a compiled jar to Ilum, which is used to launch a single job. You can either send it directly, or you can use AWS credentials to get it from an S3 bucket. A typical example of a single job usage would be some kind of data preparation task.</p><p>Ilum also provides an <strong>interactive</strong> <strong>code mode</strong>, which allows you to submit commands at runtime. This is useful for tasks where you need to interact with the data, such as exploratory data analysis.</p><h3 id="interactive-job">Interactive job</h3><p>Interactive jobs have long-running sessions, where you can send job instance data to be executed right away. The killer feature of such a mode is that you don&#x2019;t have to wait for spark context to be initialized. If users were pointing to the same job id, they would interact with the same spark context. Ilum wraps Spark application logic into a long-running Spark job which is able to handle calculation requests immediately, without the need to wait for Spark context initialization.</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2022/11/spark-job-metrics.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="1763" height="957" srcset="/blog/content/images/size/w600/2022/11/spark-job-metrics.png 600w,/blog/content/images/size/w1000/2022/11/spark-job-metrics.png 1000w,/blog/content/images/size/w1600/2022/11/spark-job-metrics.png 1600w,/blog/content/images/2022/11/spark-job-metrics.png 1763w" sizes="(min-width: 1200px) 1200px"></figure><h3 id="starting-an-interactive-job">Starting an interactive job</h3><p>Let&#x2019;s take a look at how Ilum&#x2019;s interactive session can be started. The first thing we have to do is to set up Ilum. You can do it easily with the minikube. A tutorial with Ilum installation is available under this <a href="https://ilum.cloud/blog/spark-on-kubernetes/?ref=blog.ilum.cloud">link</a>. In the next step, we have to create a jar file which contains an implementation of Ilum&apos;s job interface. To use Ilum job API, we have to add it to the project with some dependency managers, such as Maven or Gradle. In this example, we will use some Scala code with a Gradle to calculate PI. </p><p><u>The full example is available on our </u><a href="https://github.com/ilum-cloud/interactive-job-example?ref=blog.ilum.cloud"><u>GitHub</u></a><u>.</u></p><p><u>If you prefer not to build it yourself, you can find the compiled jar file </u><a href="https://ilum.cloud/release/latest/ilum-interactive-spark-pi.jar?ref=blog.ilum.cloud" rel="noreferrer"><u>here</u></a><u>.</u></p><p>The first step is to create a folder for our project and change the directory into it.</p><pre><code>$ mkdir interactive-job-example
$ cd interactive-job-example</code></pre><p>If you don&#x2019;t have the newest version of Gradle installed on your computer, you can check how to do it <a href="https://docs.gradle.org/current/userguide/installation.html?ref=blog.ilum.cloud">here</a>. Then run the following command in a terminal from inside the project directory:</p><pre><code>$ gradle init</code></pre><p>Choose a Scala application with Groovy as DSL. The output should look like this:<br></p><pre><code>Starting a Gradle Daemon (subsequent builds will be faster)

Select type of project to generate:
  1: basic
  2: application
  3: library
  4: Gradle plugin
Enter selection (default: basic) [1..4] 2

Select implementation language:
  1: C++
  2: Groovy
  3: Java
  4: Kotlin
  5: Scala
  6: Swift
Enter selection (default: Java) [1..6] 5

Split functionality across multiple subprojects?:
  1: no - only one application project
  2: yes - application and library projects
Enter selection (default: no - only one application project) [1..2] 1

Select build script DSL:
  1: Groovy
  2: Kotlin
Enter selection (default: Groovy) [1..2] 1

Generate build using new APIs and behavior (some features may change in the next minor release)? (default: no) [yes, no] no                           
Project name (default: interactive-job-example): 
Source package (default: interactive.job.example): 

&gt; Task :init
Get more help with your project: https://docs.gradle.org/7.5.1/samples/sample_building_scala_applications_multi_project.html

BUILD SUCCESSFUL in 30s
2 actionable tasks: 2 executed</code></pre><p>Now we have to add the Ilum repository and necessary dependencies into your <strong>build.gradle</strong> file. In this tutorial, we will use Scala 2.12.</p><pre><code>
dependencies {
    implementation &apos;org.scala-lang:scala-library:2.12.16&apos;
    implementation &apos;cloud.ilum:ilum-job-api:5.0.1&apos;
    compileOnly &apos;org.apache.spark:spark-sql_2.12:3.1.2&apos;
}</code></pre><p>Now we can create a Scala class that extends Ilum&#x2019;s Job and which calculates PI:</p><pre><code class="language-Scala">package interactive.job.example

import cloud.ilum.job.Job
import org.apache.spark.sql.SparkSession
import scala.math.random

class InteractiveJobExample extends Job {

  override def run(sparkSession: SparkSession, config: Map[String, Any]): Option[String] = {

    val slices = config.getOrElse(&quot;slices&quot;, &quot;2&quot;).toString.toInt
    val n = math.min(100000L * slices, Int.MaxValue).toInt
    val count = sparkSession.sparkContext.parallelize(1 until n, slices).map { i =&gt;
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x * x + y * y &lt;= 1) 1 else 0
    }.reduce(_ + _)
    Some(s&quot;Pi is roughly ${4.0 * count / (n - 1)}&quot;)
  }
}
</code></pre><p>If Gradle has generated some main or test classes, just remove them from the project and make a build. </p><pre><code>$ gradle build</code></pre><p>Generated jar file should be in &apos;<strong>./interactive-job-example/app/build/libs/app.jar</strong>&apos;, we can then switch back to Ilum. Once all pods are running, please make a port forward for ilum-ui:</p><pre><code>kubectl port-forward svc/ilum-ui 9777:9777</code></pre><p>Open Ilum UI in your browser and create a new group:</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2022/12/ilum_spark_ui_5_0.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="1898" height="763" srcset="/blog/content/images/size/w600/2022/12/ilum_spark_ui_5_0.png 600w,/blog/content/images/size/w1000/2022/12/ilum_spark_ui_5_0.png 1000w,/blog/content/images/size/w1600/2022/12/ilum_spark_ui_5_0.png 1600w,/blog/content/images/2022/12/ilum_spark_ui_5_0.png 1898w" sizes="(min-width: 1200px) 1200px"></figure><p>Put a name of a group, choose or create a cluster, upload your jar file and apply changes:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/11/ilum-create-interactive-spark-job.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="490" height="733"></figure><p>Ilum will create a Spark driver pod and you can control the number of spark executor pods by scaling them. After the spark container is ready, let&#x2019;s execute the jobs:</p><figure class="kg-card kg-image-card kg-width-wide"><img src="/blog/content/images/2022/12/ilum-interactive-spark-execute-job.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="1788" height="563" srcset="/blog/content/images/size/w600/2022/12/ilum-interactive-spark-execute-job.png 600w,/blog/content/images/size/w1000/2022/12/ilum-interactive-spark-execute-job.png 1000w,/blog/content/images/size/w1600/2022/12/ilum-interactive-spark-execute-job.png 1600w,/blog/content/images/2022/12/ilum-interactive-spark-execute-job.png 1788w" sizes="(min-width: 1200px) 1200px"></figure><p>Now we have to put the canonical name of our Scala class</p><pre><code>interactive.job.example.InteractiveJobExample</code></pre><p> and define the slices parameter in JSON format:</p><pre><code>{
  &quot;config&quot;: {
    &quot;slices&quot;: &quot;10&quot;
  }
}</code></pre><p>You should see the outcome right after the job started</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/11/ilum-spark-interactive-job-result.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="1124" height="749" srcset="/blog/content/images/size/w600/2022/11/ilum-spark-interactive-job-result.png 600w,/blog/content/images/size/w1000/2022/11/ilum-spark-interactive-job-result.png 1000w,/blog/content/images/2022/11/ilum-spark-interactive-job-result.png 1124w" sizes="(min-width: 720px) 720px"></figure><p>You can change parameters, and rerun a job and your calculations will occur on the spot.</p><h3 id="interactive-and-single-job-comparison">Interactive and single job comparison</h3><p>In Ilum you can also run a single job. The most important difference compared to interactive mode is that you don&#x2019;t have to implement the Job API. We can use the SparkPi jar from Spark examples:</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/11/ilum-spark-ui-simple-job.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="480" height="649"></figure><p>Running a job like this is also quick, but interactive jobs are <strong>20 times faster (4s vs 200ms)</strong>. If you would like to start a similar job with other parameters, you will have to prepare a new job and upload the jar again.<br></p><h3 id="ilum-and-plain-apache-spark-comparison">Ilum and plain Apache Spark comparison</h3><p><br>I&apos;ve set up Apache Spark locally with a <a href="https://hub.docker.com/r/bitnami/spark?ref=blog.ilum.cloud">bitnami/spark</a> docker image. If you would like also to run Spark on your machine, you can use docker-compose:</p><pre><code>$ curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml
$ docker-compose up</code></pre><p>Once Spark is running, you should be able to go to localhost:8080 and see the admin UI. We need to get the Spark URL from the browser:</p><figure class="kg-card kg-image-card"><img src="https://lh3.googleusercontent.com/mNjqIQKqLj9y5aFGmRpWGCBq-61UjUPnXqlySHTvxqUbLkAfNGuDES1wTdZ05rcQ4wX6Bvsq5ZxOkkgspyM-Ibx0ps59u8OliNT15coAtYRZwnd4hnlldcKpGh377vQQ2dGYvL_CgympKf1lYzhCOOBMHxznAVfoM2u4zAgIW3TMerDAB0ixhSE-1g" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="642" height="58"></figure><p>Then, we have to open the Spark container in interactive mode using </p><pre><code>$ docker exec -it &lt;containerid&gt; -- bash</code></pre><figure class="kg-card kg-image-card"><img src="https://lh4.googleusercontent.com/bsDA4-3VFNJ_gn_KcgPC7kZ1LwpugblM-I7JDeLjciQFSzdzUrRntgLidVK9uIADUxH-bgz9puxGdRKA-0BwNpqlsD7iVljrw-4BIoZNUBrRO-0Nw-8kWX7XIwDzsZPHg9NRhbiYqS9Be12fli5e2hQy8Wqd6wNUOYKlGR-GxUhqwohZzp7hWq4P2g" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="432" height="17"></figure><p>And now inside the container, we can submit the sparkPi job. In this case, will use SparkiPi from the examples jar and, as a master parameter, put the URL from the browser:</p><pre><code>$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi\
  --master spark://78c84485d233:7077 \
  /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar\
  10</code></pre><h3 id="summary">Summary</h3><p>As you can see in the example above, you can avoid the complicated configuration and installation of your Spark client by using Ilum. Ilum takes over the work and provides you with a simple and convenient interface. Moreover, it allows you to overcome the limitations of Apache Spark, which can take a very long time to initialize. If you have to do many job executions with similar logic but different parameters and would like to have calculations done immediately, you should definitely use interactive job mode.</p><figure class="kg-card kg-image-card"><img src="/blog/content/images/2022/09/ilum-spark-ferret-1.png" class="kg-image" alt="How to optimize your Spark Cluster with Interactive Spark Jobs" loading="lazy" width="1024" height="1007" srcset="/blog/content/images/size/w600/2022/09/ilum-spark-ferret-1.png 600w,/blog/content/images/size/w1000/2022/09/ilum-spark-ferret-1.png 1000w,/blog/content/images/2022/09/ilum-spark-ferret-1.png 1024w" sizes="(min-width: 720px) 720px"></figure><h3 id="similarities-with-apache-livy">Similarities with Apache Livy</h3><p>Ilum is a cloud-native tool for managing Apache Spark deployments on Kubernetes. It is similar to Apache Livy in terms of functionality - it can control a Spark Session over REST API and build a real-time interaction with a Spark Cluster. However, Ilum is designed specifically for modern, cloud-native environments.</p><p>We used Apache Livy in the past, but we have reached the point where Livy was just not suitable for modern environments. <strong>Livy is obsolete</strong> compared to Ilum. In 2018, we started moving all our environments to Kubernetes, and we had to find a way to deploy, monitor and maintain Apache Spark on Kubernetes. This was the perfect occasion to build Ilum.</p><div class="kg-card kg-button-card kg-align-center"><a href="https://ilum.cloud/?ref=blog.ilum.cloud" class="kg-btn kg-btn-accent">Try it, it&apos;s free</a></div><p><br></p>]]></content:encoded></item></channel></rss>