6 min read

How to run Apache Spark on Kubernetes in less than 5min

You can have Apache Spark up and running on Kubernetes in just 5minutes.

This article is the first in the series of a guide called “Kubernetes: The perfect platform for Spark applications”. During the next several weeks, we will cover all aspects of installing, running, managing and monitoring Spark on K8s.

Today we will show how to get up and running with Apache Spark on Kubernetes. There are several ways to do this, but most of them are complex and require a lot of configurations. We will use Ilum because it will set up the whole cluster for us. In the future, we will compare the usage of the spark operator.

Spark on Kubernetes with Ilum

Ilum is a Kubernetes-native solution that makes it easy to deploy and manage Apache Spark clusters. It provides a simple API that makes it easy to define and manage Spark. It will handle all dependencies for you.

With Ilum, you can deploy Spark clusters in minutes, and get started running Spark applications immediately. Ilum makes it easy to scale Spark clusters up or down and to manage multiple Spark clusters from a single UI.

If you're new to Apache Spark or Kubernetes, Ilum is a great way to get started.

Quick start

We'll assume that you have a Kubernetes cluster up and running. If you don't, you can follow these instructions to set up a Kubernetes cluster on the minikube. Check how to install minikube.

Setup a kubernetes cluster

minikube start --cpus 4 --memory 8192 --addons metrics-server

Once you have a Kubernetes cluster up and running, you can install Ilum with just a few commands:

Spark on K8s with Ilum

helm repo add ilum https://charts.ilum.cloud
helm install ilum ilum/ilum

This will install Ilum into your Kubernetes cluster. It should take around 2 minutes to initialize.

Spark on Kubernetes with Ilum

Once the Ilum is installed, you can access the UI with port-forward and localhost:9777.

kubectl port-forward svc/ilum-ui 9777:9777
Spark on Kubernetes with Ilum

That’s all, your kubernetes cluster is now configured to handle spark jobs. Ilum provides a simple API and UI that makes it easy to submit Spark applications. You can also use the good old spark submit.

Submit a spark application on kubernetes

Let’s now start a simple spark job. We'll use the "SparkPi" example from the Spark documentation. You can use the jar file from this link.

Ilum simple job

Ilum will create a Spark driver pod, it uses spark 3.x docker image. You can control the number of spark executor pods by scaling them to multiple nodes.

Spark on Kubernetes with Ilum

Ilum is a software that makes running Spark on Kubernetes a breeze. It will configure the entire cluster for you and provide an interface where you can manage and monitor your Spark cluster. We believe that spark apps on kubernetes will shape the future of big data. By using kubernetes, spark apps will be able to handle large amounts of data more efficiently and reliably. This will allow for more accurate insights and decision making when it comes to big data.

Migrating from Apache Hadoop Yarn

As Apache Hadoop Yarn is in deep stagnation, more and more organizations are looking to migrate from Yarn to Kubernetes. There are a number of reasons for this, but the most common one is that Kubernetes provides a more robust and flexible platform for managing Big Data workloads.

Migrating from Apache Hadoop Yarn to another data processing platform can be a daunting task. There are many factors to consider when making such a switch, including data compatibility, processing speed, and cost. However, with careful planning and execution, the process can be smooth and successful.

hadoop yarn migration

Kubernetes is a natural fit for Big Data workloads because of its ability to scale horizontally. With Hadoop Yarn, you are limited to the number of nodes in your cluster. With Kubernetes, you can add nodes to your cluster on-demand, and remove them when they are no longer needed.

Kubernetes also provides a number of features that are not available in Yarn, such as self-healing, and horizontal scaling.

Time to make the Switch to Kubernetes?

As the world of big data continues to evolve, so do the tools and technologies used to manage it. For years, Apache Hadoop YARN has been the de facto standard for resource management in big data environments. But with the rise of containerization and orchestration technologies like Kubernetes, is it time to make the switch?

Kubernetes has been gaining popularity as a container orchestration platform, and for good reason. It's flexible, scalable, and relatively easy to use. If you're still using traditional VM-based infrastructure, now might be the time to make the switch to Kubernetes.

If you're working with containers, then you should definitely care about Kubernetes. It can help you manage and deploy your containers more effectively, and it's especially useful if you're working with a lot of containers or if you're deploying your containers to a cloud platform.

Kubernetes is also a great choice if you're looking for an orchestration tool that's backed by a major tech company. Google has been using Kubernetes for years to manage its own containerized applications, and they've invested a lot of time and resources into making it a great tool.

There is no clear winner in the YARN vs. Kubernetes debate. The best solution for your organization will depend on your specific needs and use cases. If you are looking for a more flexible and scalable resource management solution, Kubernetes is worth considering. If you need better support for legacy applications, YARN may be a better option.

Whichever platform you choose, Ilum can help you get the most out of it. Our platform is designed to work with both YARN and Kubernetes, and our team of experts can help you choose and implement the right solution for your organization.

Managed Spark cluster

A managed Spark cluster is a cloud-based solution that makes it easy to provision and manage Spark clusters. It provides a web-based interface for creating and managing Spark clusters, as well as a set of APIs for automating cluster management tasks. Managed Spark clusters are often used by data scientists and developers who want to quickly provision and manage Spark clusters without having to worry about the underlying infrastructure.

Ilum provides the ability to create and manage your own spark cluster, which can be run in any environment, including cloud, on-premises, or a mixture of both.

ilum ferret

The Pros of Apache Spark on Kubernetes

There has been some debate about whether Apache Spark should run on Kubernetes.

Some people argue that Kubernetes is too complex and that Spark should continue to run on its own dedicated cluster manager or stay in the cloud. Others argue that Kubernetes is the future of big data processing and that Spark should embrace it.

We are in the latter camp. We believe that Kubernetes is the future of big data processing and that Apache Spark should run on Kubernetes.

The biggest benefit of using Spark on Kubernetes is that it allows for much easier scaling of Spark applications. This is because Kubernetes is designed to handle deployments of large numbers of concurrent containers. So, if you have a Spark application that needs to process a lot of data, you can simply deploy more containers to the Kubernetes cluster to process the data in parallel. This is much easier than setting up a new Spark cluster on EMR each time you need to scale up your processing. You can run it on any cloud platform (AWS, Google Cloud, Azure, etc.) or on-premises. This means that you can easily move your Spark applications from one environment to another without having to worry about changing your cluster manager.

Another enormous benefit is that it allows for more flexible workflows. For example, if you need to process data from multiple sources, you can easily deploy different containers for each source and have them all processed in parallel. This is much easier than trying to manage a complex workflow on a single Spark cluster.

Kubernetes has several security features that make it a more attractive option for running Spark applications. For example, Kubernetes supports role-based access control, which allows you to fine-tune who has access to your Spark cluster.

So there you have it. These are just some of the reasons why we believe that Apache Spark should run on Kubernetes. If you're not convinced, we encourage you to try it out for yourself. We think you'll be surprised at how well it works.