Chapter 6 Clusters

Previous chapters focused on using Spark over a single computing instance, your personal computer. In this chapter we will introduce techniques to run Spark over multiple computing instances, also known as a computing cluster. This chapter and subsequent ones will introduce and make use of concepts applicable to computing clusters; however, it’s not required to use a computing cluster to follow along, you can still use your personal computer. It’s worth mentioning that while previous chapters focused on single computing instances, all the data analysis and modeling techniques we presented, can also be used in a computing cluster without changing any code.

For those of you who already have a Spark cluster in your organization, you could consider skipping to the next chapter, Connections, which will teach you how to connect to an existing cluster. Otherwise, if you don’t have a cluster or are considering improvements to your existing infrastructure, this chapter will introduce the cluster trends, managers, and providers available today.

6.1 Overview

There are three major trends in cluster computing worth discussing: On-Premise, Cloud computing, and Kubernetes. Framing these trends over time will help us understand how they came to be, what they are, and what their future might be. To illustrate this, Figure 6.1 plots these trends over time using data from Google trends.

Google trends for on-premise (mainframe), cloud computing and Kubernetes

FIGURE 6.1: Google trends for on-premise (mainframe), cloud computing and Kubernetes

For on-premise clusters, yourself or someone in your organization purchased physical computers that were intended to be used for cluster computing. The computers in this cluster are made of off-the-shelf hardware, meaning that someone placed an order to purchase computers usually found in stores shelves or, high-performance hardware, meaning that a computing vendor provided highly customized computing hardware which also comes optimized for high-performance network connectivity, power consumption, etc. When purchasing hundreds or thousands of computing instances, it doesn’t make sense to keep them in the usual computing case that we are all familiar with, instead, it makes sense to stack them as efficiently as possible on top of each other to minimize the space the use. This group of efficiently stacked computing instances is known as a rack. Once a cluster grows to thousands of computers, you will also need to host hundreds of racks of computing devices, at this scale, you would also need significant physical space to hosts those racks. A building that provides racks of computing instances is usually known as a data center. At the scale of a data center, you would also need to find ways to make the building more efficient, specially the cooling system, power supplies, network connectivity, and so on. Since this is time consuming, a few organization have come together to open source their infrastructure under the Open Compute Project initiative, which provides a set of data center blueprints free for anyone to use.

There is nothing preventing you from building our own data center and in fact, many organizations have followed this path. For instance, Amazon started as an online book store, over the years Amazon grew to sell much more than just books and, with its online store growth, their data centers also grew in size. In 2002, Amazon considered renting servers in their data centers to the public, two year laters, Amazon Web Services launched as a way to let anyone rent servers in their data centers on-demand, meaning that, one did not have to purchase, configure, maintain nor teardown it’s own clusters but could rather rent them from Amazon directly.

This on-demand compute model is what we know today as Cloud Computing. In the cloud, the cluster you use is not owned by you and it’s neither in your physical building, but rather, it’s a data center owned and managed by someone else. Today, there are many cloud providers in this space ranging from Amazon, Databricks, IBM, Google, Microsoft and many others. Most cloud computing platforms provide a user interface either through a web application and command line to request and manage resources.

While the benefits of processing data in the cloud were obvious for many years, picking a cloud provider had the unintended side-effect of locking organizations with one particular provider, making it hard to switch between providers or back to on-premise clusters. Kubernetes, announced by Google in 2014, is an open source system for managing containerized applications across multiple hosts. In practice, it makes it easier to deploy across multiple cloud providers and on-premise as well.

In summary, we have seen a transition from on-premise, to cloud computing and more recently Kubernetes which have been also described as the private cloud, the public cloud and the hybrid cloud respectevely. This chapter will walk you through each cluster computing trend in the context of Spark and R.

6.2 On-Premise

As mentioned in the overview section, on-premise clusters represent a set of computing instances procured and managed by staff members from your organization. These clusters can be highly customized and controlled; however, they can also incur higher initial expenses and maintenance costs.

When using On-Premise Spark clusters, there are two concepts you should consider:

  • Cluster Manager: In a similar way as to how an Operating Systems (like Windows os OS X) allows you to run multiple applications in the same computer; a cluster manager allows multiple applications to be run in the same cluster. You will have to choose one yourself when working with On-Premise clusters.
  • Spark Distribution: While you can install Spark from the Apache Spark site, many companies partner with companies that can provide support and enhancements to Apache Spark which we often reffer as, Spark distributions.

6.2.1 Managers

In order to run Spark within a computing cluster, you will need to run software capable of initializing Spark over each pyshical machine and register all the available computing nodes, this software is known as a cluster manager. The available cluster managers in Spark are: Spark Standalone, YARN, Mesos and Kubernetes.

Note: In distributed systems and clusters literature, we often refer to each physical machine as a compute instance, compute node, instance or node.

6.2.1.1 Standalone

In Spark Standalone, Spark uses itself as its own cluster manager, which allows you to use Spark without installing additional software in your cluster. This can be useful if you are planning to use your cluster to only run Spark applications; if this cluster is not dedicated to Spark, a generic cluster manager like YARN, Mesos or Kubernetes would be more suitable. The Spark Standalone documentation is available under spark.apache.org (“Spark Standalone Mode” 2019) and contains detailed information on configuring, launching, monitoring and enabling high-availability, see Figure 6.2.

Spark Standalone Site

FIGURE 6.2: Spark Standalone Site

However, since Spark Standalone is contained within a Spark installation; then, by completing the Getting Started chapter, you have now a Spark installation available that you can use to initialize a local Spark Standalone cluster in your own machine. In practice, you would want to start the worker nodes in different machines but, for simplicity, we will present the code to start a standalone cluster in a single machine.

First, retrieve the SPARK_HOME directory by running spark_home_dir() then, run start-master.sh and start-slave.sh as follows:

# Retrieve the Spark installation directory
spark_home <- spark_home_dir()

# Build path to start-master.sh
start_master <- file.path(spark_home, "sbin", "start-master.sh")

# Execute start-master.sh to start the cluster manager master node
system2(start_master)

# Build path to start-slave
start_slave <- file.path(spark_home, "sbin", "start-slave.sh")

# Execute start-slave.sh to start a worker and register in master node
system2(start_slave, paste0("spark://", system2("hostname", stdout = TRUE), ":7077"))

The previous command initialized the master node and a worker node, the master node interface can be accessed under localhost:8080 as captured in Figure 6.3:

Spark Standalone Web Interface.

FIGURE 6.3: Spark Standalone Web Interface.

Notice that there is one worker register in Spark standalone, you can follow the link to this worker node to see, Figure 6.4, details for this particular worker like available memory and cores.

Spark Standalone Worker Web Interface

FIGURE 6.4: Spark Standalone Worker Web Interface

Once you are done performing computations in this cluster, you can simply stop all the running nodes in this local cluster by running:

# Build path to stop-all
stop_all <- file.path(spark_home, "sbin", "stop-all.sh")

# Execute stop-all.sh to stop the workers and master nodes
system2(stop_all)

A similar approach can be followed to configure a cluster by running each start-slave.sh command over each machine in the cluster.

Note: When running on a Mac, if you hit: ssh: connect to host localhost port 22: Connection refused, you will need to manually turn off the workers using system2("jps") to list the running Java process and then, system2("kill", c("-9", "<process id>")) to stop the specific workers.

6.2.1.2 Yarn

YARN for short, or Hadoop YARN, is the resource manager of the Hadoop project. It was originally developed in the Hadoop project but, refactored into it’s own project in Hadoop 2. As we mentioned in in the introduction chapter, Spark was built to speed up computation over Hadoop and therefore, it’s very common to find Spark intalled on Hadoop clusters.

One advantage of YARN, is that it is likely to be already installed in many existing clusters that support Hadoop; which means that you can easily use Spark with many existing Hadoop clusters without requesting any major changes to the existing cluster infrastructure. It is also very common to find Spark deployed in YARN clusters since many started out as Hadoop clusters that were eventually upgraded to also support Spark.

YARN applications can be submitted in two modes: yarn-client and yarn-cluster. In yarn-cluster mode the driver is running remotely (potentially), while in yarn-client mode, the driver is running locally, both modes are supported and are explained further in the connections chapter.

YARN provides a resource management user interface useful to access logs, monitor available resources, terminate applications, etc. Once connecting to Spark from R, you will be able to manage the running application in YARN, this is shown in Figure 6.5.

YARN's Resource Manager running sparklyr application

FIGURE 6.5: YARN’s Resource Manager running sparklyr application

Since YARN is the cluster manager from the Hadoop project, YARN’s documentation can be found under the hadoop.apache.org (“Apache Hadoop Yarn” 2019), you can also reference the “Running Spark on YARN” guide from spark.apache.org (???).

6.2.1.3 Mesos

Apache Mesos is an open-source project to manage computer clusters. Mesos began as a research project in the UC Berkeley RAD Lab and makes use of Linux Cgroups to provide isolation for CPU, memory, I/O and file system access.

Mesos, like YARN, supports executing many cluster frameworks, including Spark. However, one advantage particular to Mesos is that, it allows cluster framework like Spark to implement custom task schedulers. An scheduler is the component that coordinates in a cluster which applications get execution time and which resources are assign to them. Spark uses a coarse-grained scheduler (“Spark on Mesos” 2018) which schedules resources for the duration of the application; however, other frameworks might use Mesos’ fine-grained scheduler, which can increase the overall efficiency in the cluster by scheduling tasks in shorter intervals allowing them to share resources between them.

Mesos provides a web interface to manage your running applications, resources, and so on. After connecting to Spark from R, your application will be registered like any other application running in Mesos, Figure 6.6 shows a successful connection to Spark from R.

Mesos web interface running Spark from R

FIGURE 6.6: Mesos web interface running Spark from R

Mesos is an Apache project with its documentation available under mesos.apache.org. The “Running Spark on Mesos” guide from spark.apache.org is also a great resource if you choose to use Mesos as your cluster manager.

6.2.2 Distributions

One can use a cluster manager in on-premise clusters as described in the previous section; however, many organizations choose to partner with companies providing additional management software, services and resources to help manage applications in their cluster; including, but not limited to, Apache Spark. Some of the on-premise cluster providers include: Cloudera, Hortonworks and MapR to mention a few which we will be briefly introduce next.

Cloudera, Inc. is a United States-based software company that provides Apache Hadoop and Apache Spark-based software, support and services, and training to business customers. Cloudera’s hybrid open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology. Cloudera donates more than 50% of its engineering output to the various Apache-licensed open source projects (Apache Hive, Apache Avro, Apache HBase, and so on) that combine to form the Apache Hadoop platform. Cloudera is also a sponsor of the Apache Software Foundation (“Cloudera Wikipedia” 2018).

Cloudera clusters make use of parcels, which are binary distributions containing program files and metadata (“Cloudera Documentation” 2018), Spark happens to be installed as a parcel in Cloudera. It’s beyond the scope of this book to present how to configure Cloudera clusters, resources and documentation can be found under cloudera.com, and “Introducing sparklyr, an R Interface for Apache Spark” (“Cloudera Engineering” 2016) under Cloudera’s Engineering Blog.

Cloudera provides the Cloudera Manager web interface to manage resource, services, parcels, diagnostics, etc. Figure 6.7 shows a Spark parcel running in Cloduera Manager which you can later use to connect from R.

Cloudera Manager running Spark parcel

FIGURE 6.7: Cloudera Manager running Spark parcel

sparklyr is certified with Cloudera (“Cloudera Partners” 2017); meaning that, Cloudera’s support is aware of sparklyr and can be effective helping organizations that are using Spark and R, the following table summarizes the versions currently certified.

Cloudera Version Product Version Components Kerberos
CDH5.9 sparklyr 0.5 HDFS, Spark Yes
CDH5.9 sparklyr 0.6 HDFS, Spark Yes
CDH5.9 sparklyr 0.7 HDFS, Spark Yes

Hortonworks is a big data software company based in Santa Clara, California. The company develops, supports, and provides expertise on an expansive set of entirely open source software designed to manage data and processing for everything from IOT, to advanced analytics and machine learning. Hortonworks believes it is a data management company bridging the cloud and the datacenter (“Hortonworks Wikipedia” 2018).

Hortonworks partnered with Microsoft (“Hortonworks Microsoft” 2018) to improve support in Microsoft Windows for Hadoop and Spark, this used to be a differentiation point; however, comparing Hortonworks and Cloudera is less relevant today since both companies are merging in 2019 (“Hortonworks Cloudera” 2018). While the companies are merging, support for the Cloudera and Hortonworks Spark distributions are still available. Additional resources to configure Spark under Hortonworks are available under hortonworks.com.

MapR is a business software company headquartered in Santa Clara, California. MapR provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications. Its technology runs on both commodity hardware and public cloud computing services (“MapR Wikipedia” 2018).

6.3 Cloud

If you don’t have an on-prem cluster nor spare machines to reuse, starting with a cloud cluster can be quite convenient since it will allow you to access a proper cluster in a matter of minutes. This section will briefly mention some of the major cloud infrastructure providers and give you resources to help you get started if you choose to use a cloud provider.

In cloud services, the compute instances are billed for as long the Spark cluster runs; you start getting billed when the cluster launches and stops when the cluster stops. This cost needs to be multiplied by the number of instances reserved for your cluster. SO for instance, if a cloud provider chargets $1.00USD per compute instance per hour and you start a three node cluster that you use for one hour and 10 minutes; it is likely that you’ll get billed $1.00 * 2 hours * 3 nodes = $6.00. Some cloud providers charge per minute but, at least, you can rely on all of them charging per compute hour.

Please be aware that, while compute costs can be quite low for small clusters, accidentally leaving a cluster running can cause significant billing expenses. Therefore, is is worth taking the extra time to check twice that your cluster is terminated when you no longer need it. It’s also a good practice to monitor costs daily while using clusters to make sure your expectations match the daily bill.

From past experience, you should also plan to request compute resources in advance while dealing with large-scale projects; various cloud providers will not allow you to start a cluster with hundreds of machines before requesting them explicitly through a support request. While this can be cumbersome, it’s also a way to help you controll costs in your organization.

Since the cluster size is flexible, it is a good practice to start with small clusters and scale compute resources as needed. Even if you know in advance that a cluster of significant size will be required, starting small provides an opportunity to troubleshoot issues at a lower cost since it’s unlikely that your data analysis will run at scale flawlessly on the first try. As a rule oh thumb, grow the instances exponentially; if you need to run a computation over an eight node cluster, start with one node and an eighth of the entire dataset, then two nodes with a fourth, then four nodes with a half the dataset and then, finally, eight nodes and the entire dataset. As you become more experienced, you’ll develop a good sense of how to troubleshoot issues, the size of the required cluster and you’ll be able to skip intermediate steps, but for starters, this is a good practice to follow.

One can also use a cloud provider to acquire bare computing resources and then, install the on-premise distributions presented in the previous section yourself; for instance, you can run the Cloudera distribution on Amazon Elastic Compute Cloud (EC2). This model would avoid procuring colocated hardware, but still allow you to closely manage and customize your cluster. This book presents an overview of only the fully-managed Spark services available by cloud providers; however, you can usually find with ease instructions online on how to install on-premise distributions in the cloud.

Some of the major providers of cloud computing infrastructure are: Amazon, Databricks, Google, IBM and Microsoft that this section will briefly introduce.

6.3.1 Amazon

Amazon provides cloud services through Amazon Web Services(Amazon AWS); more specifically, provides an on-demand Spark cluster through Amazon Elastic MapReduce or EMR for short,

Detailed instructions on using R with Amazon EMR was published under Amazon’s Big Data Blog: “Running sparklyr on Amazon EMR” (“AWS Blog” 2016), this post introduced the launch of sparklyr and instructions to configure EMR clusters with sparklyr. For instance, it suggests you can use the Amazon Command Line Interface to launch a cluster with three nodes as follows:

aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Hive \
  --release-label emr-5.8.0 --service-role EMR_DefaultRole --instance-groups \
  InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.2xlarge \
  InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.2xlarge \ 
  --bootstrap-action Path=s3://aws-bigdata-blog/artifacts/aws-blog-emr-\
rstudio-sparklyr/rstudio_sparklyr_emr5.sh,Args=["--user-pw", "<password>", \
  "--rstudio", "--arrow"] --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole

You can then see the cluster launching, and eventually running under the AWS portal, see Figure 6.8.

Launching an Amazon EMR Cluster

FIGURE 6.8: Launching an Amazon EMR Cluster

You can then navigate to the Master Public DNS and find RStudio under port 8787, for instance: ec2-12-34-567-890.us-west-1.compute.amazonaws.com:8787, and then login with user hadoop and password <password>.

It is also possible to launch the EMR cluster using the web interface, the same introductory post contains additional details and walkthroughs specifically designed for EMR.

Please remember to turn off your cluster to avoid unnecessary charges and use appropriate security restrictions when starting EMR clusters for sensitive data analysis.

Regarding cost, the most up to date information can be found under aws.amazon.com/emr/pricing. As of this writing, these are some of the instance types available in the us-west-1 region, it is meant to provide a glimpse of the resources and costs associated with cloud processing. Notice that the “EMR price is in addition to the Amazon EC2 price (the price for the underlying servers)”.

Instance CPUs Memory Storage EC2 Cost EMR Cost
c1.medium 2 1.7GB 350GB $0.148 USD/hr $0.030 USD/hr
m3.2xlarge 8 30GB 160GB $0.616 USD/hr $0.140 USD/hr
i2.8xlarge 32 244GB 6400GB $7.502 USD/hr $0.270 USD/hr

Note: We are only presentring a subset of the available compute instances for Amazon and subsequent cloud providers during 2019; however, please note that hardware (CPU speed, hard drive speed, etc.) varies between vendors and locations; therefore, you can’t use these hardware tables as an accurate price comparison, an accurate comparison would require running your particular workloads and considering other aspects beyond compute instance cost.

6.3.2 Databricks

Databricks is a company founded by the creators of Apache Spark, that aims to help clients with cloud-based big data processing using Spark. Databricks grew out of the AMPLab project at University of California, Berkeley (“Databricks Wikipedia” 2018).

Databricks provides enterprise-level cluster computing plans, while also providing a free/community tear to explore functionality and get familiar with their environment.

Once a cluster is launched, R and sparklyr can be used from Databricks notebooks following the steps from the Getting Started chapter or, by installing RStudio on Databricks (“Databricks Rstudio” 2018). Figure 6.9 shows a Databricks notebook using Spark through sparkylr.

Databricks community notebook running sparklyr

FIGURE 6.9: Databricks community notebook running sparklyr

Additional resources are available under the Databricks Engineering Blog post: “Using sparklyr in Databricks” (“Databricks Blog” 2017) and the “Databricks Documentation for sparklyr” (“Databricks Documentation” 2018).

The latest pricing information can be found under databricks.com/product/pricing, as of this writing, available plans

Plan Basic Data Engineering Data Analytics
AWS Standard $0.07 USD/DBU $0.20 USD/DBU $0.40 USD/DBU
Azure Standard $0.20 USD/DBU $0.40 USD/DBU
Azure Premium $0.35 USD/DBU $0.55 USD/DBU

Notice that pricing is based on cost of DBU/hr. From Databricks, “A Databricks Unit (DBU) is a unit of Apache Spark processing capability per hour. For a varied set of instances, DBUs are a more transparent way to view usage instead of the node-hour” (“Databricks Units” 2018).

6.3.3 Google

Google provides Gooble Cloud Dataproc as a cloud-based managed Spark and Hadoop service offered on Google Cloud Platform. Dataproc utilizes many Google Cloud Platform technologies such as Google Compute Engine and Google Cloud Storage to offer fully managed clusters running popular data processing frameworks such as Apache Hadoop and Apache Spark (“Dataproc Wikipedia” 2018).

A cluster can be easily created from the Google Cloud console or the Google Cloud command line interface as illustrated in Figure 6.10.

Launching a Dataproc cluster

FIGURE 6.10: Launching a Dataproc cluster

Once created, ports can be forwarded to allow you to access this cluster from your machine; for instance, by launching Chrome to make use of this proxy and securely connect to the Dataproc cluster. Configuring this connection looks as follows:

gcloud compute ssh sparklyr-m --project=<project> --zone=<region> -- -D 1080 \
  -N "<path to chrome>" --proxy-server="socks5://localhost:1080" \
  --user-data-dir="/tmp/sparklyr-m" http://sparklyr-m:8088

There are various tutorials available under cloud.google.com/dataproc/docs/tutorials, including, a comprehensive tutorial to configure RStudio and sparklyr (“Dataproc Sparklyr” 2018).

The latest pricing information can be found under cloud.google.com/dataproc/pricing. Notice that the cost is split between Compute Engine and a Dataproc Premium.

Instance CPUs Memory Compute Engine Dataproc Premium
n1-standard-1 1 3.75GB $0.0475 USD/hr $0.010 USD/hr
n1-standard-8 8 30GB $0.3800 USD/hr $0.080 USD/hr
n1-standard-64 64 244GB $3.0400 USD/hr $0.640 USD/hr

6.3.4 IBM

IBM cloud computing is a set of cloud computing services for business offered by the information technology company IBM. IBM cloud includes infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) offered through public, private and hybrid cloud delivery models, in addition to the components that make up those clouds (“IBM Cloud Wikipedia” 2018).

From within IBM Cloud, open Watson Studio and create a Data Science project, add a Spark cluster under the project settings and launch RStudio from the Launch IDE menu. Please note that, as of this writting, the provided version of sparklyr was not the latest version available in CRAN, since sparklyr was modified to run under the IBM Cloud. In any case, please follow IBMs documentation as an authoritative reference to run R and Spark on the IBM Cloud and particularily, on how to upgrade sparklyr appropiately. Figure 6.11 captures IBM’s Cloud portal launching a Spark cluster.

IBM Watson Studio launching Spark with R support

FIGURE 6.11: IBM Watson Studio launching Spark with R support

The most up to date pricing information is available under ibm.com/cloud/pricing. In the following table, compute cost was normalized using 31 days from the per-month costs.

Instance CPUs Memory Storage Cost
C1.1x1x25 1 1GB 25GB $0.033 USD/hr
C1.4x4x25 4 4GB 25GB $0.133 USD/hr
C1.32x32x25 32 25GB 25GB $0.962 USD/hr

6.3.5 Microsoft

Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through a global network of Microsoft-managed data centers. It provides software as a service (SaaS), platform as a service (PaaS) and infrastructure as a service (IaaS) and supports many different programming languages, tools and frameworks, including both Microsoft-specific and third-party software and systems (“Azure Wikipedia” 2018).

From the Azure portal, the Azure HDInsight service provides support for on-demand Spark clusters. An HDInsight cluster with support for Spark and RStudio can be easily created by selecting the ML Services cluster type. Please note that the provided version of sparklyr might not be the latest version available in CRAN since the default package repo seems to be initialized using an MRAN (Microsoft R Application Network) snapshot, not directly from CRAN. Figure 6.12 shows the Azure portal launching an Spark cluster with support for R.

Creating an Azure HDInsight Spark Cluster

FIGURE 6.12: Creating an Azure HDInsight Spark Cluster

Up to date pricing for HDInsight is available under azure.microsoft.com/en-us/pricing/details/hdinsight.

Instance CPUs Memory Total Cost
D1 v2 1 3.5 GB $0.074/hour
D4 v2 8 28 GB $0.59/hour
G5 64 448 GB $9.298/hour

6.4 Kubernetes

Kubernetes is an open-source container-orchestration system for automating deployment, scaling and management of containerized applications that was originally designed by Google and now maintained by the Cloud Native Computing Foundation. Kubernetes was originally based on Docker while, like Mesos, it’s also based on Linux Cgroups.

Kubernetes can execute many cluster applications and frameworks that can be highly customized by using container images with specific resources and libraries. This allows a single Kubernetes cluster to be used for many different purposes beyond data analysis, which in turn helps organizations manage their compute resources with ease. One trade off from using custom images is that they add additional configuration overhead but make kubernetes clusters extremely flexible. Nevertheless, this flexibility has proven to be instrumental to administrate with ease cluster resources in many organizations and, as shown in the overview section, it’s becoming a very popular cluster framework.

Kubernetes is supported across all major cloud providers. They all provide extensive documentation as to how to launch, manage and teard down Kubernetes clusters; Figure 6.13 shows Google Gloud’s console while creating a Kubernetes cluster. Spark can be deployed over any Kubernetes cluster and R used to connect, analyze, model and so on.

Creating a Kubernetes cluster for Spark and R using Google Cloud

FIGURE 6.13: Creating a Kubernetes cluster for Spark and R using Google Cloud

You can learn more about kubernetes.io, and the “Running Spark on Kubernetes” guide from spark.apache.org.

Strictly speaking, Kubernetes is a cluster technology not an specific cluster architecture. However, Kubernetes represents a larger trend often refered as a hybrid cloud. A hybrid cloud is a computing environment that makes use of on-premises and public cloud services with orchestration between the various platforms. It’s still early to precesily categorize the leading technologies that will form a hybrid approach to cluster computing; while Kubernetes is the leading one, many more are likely to form to complement or even replace existing technologies.

6.5 Tools

While using only R and Spark can be sufficient for some clusters, it is common to install complementary tools in your cluster to improve: monitoring, sql analysis, workflow coordination, etc. with applications like Ganglia, Hue and Oozie respectively. This section is not meant to cover all, but rather mention the ones that are commonly to use.

6.5.1 RStudio

From reading the Introduction chapter, you are aware that RStudio is a well known, free, desktop development environment for R; therefore, it is likely that you are following the examples in this book using RStudio Desktop; however, you might not be aware that RStudio can also be run as a web service inside an Spark cluster, this version of RStudio is known as RStudio Server. You can see RStudio Server running in Figure 6.14. In the same way that the Spark UI runs in the cluster, RStudio Server can be installed inside the cluster, then you can connect to RStudio Server and use RStudio in exactly the same way you use RStudio Desktop but with the ability to run code against the Spark cluster. As you can see on the following image, RStudio Server is running on a web browser inside a Spark cluster; it looks and feels just like RStudio Desktop, but adds support to run commands efficiently by being located within the cluster.

RStudio Server Pro running inside Apache Spark

FIGURE 6.14: RStudio Server Pro running inside Apache Spark

For those familiar with R, Shiny is a very popular tool for building interactive web applications from R; which it is also recommended you install directly in your Spark cluster.

RStudio Server and Shiny Server are a free and open source; however, RStudio also provides professional producs, like: RStudio Server, RStudio Server Pro (“RStudio Server Pro” 2019), Shiny Server Pro (“Shiny Server Pro” 2019) and RStudio Connect (“RStudio Connect” 2019) which can be installed within the cluster to support additional R workflows, while sparklyr does not require any additional tools, they provide significant productivity gains worth considering. You can learn more about them at rstudio.com/products/.

6.5.2 Jupyter

Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. A Jupyter notebook, provide support for various programming languages, including R. sparklyr can be used with Jupyter notebooks using the R Kernel. Figure 6.15 shows sparklyr running inside a local Jupyter notebook.

Jupyter notebook running sparklyr

FIGURE 6.15: Jupyter notebook running sparklyr

6.5.3 Livy

Apache Livy is an incubation project in Apache providing support to use Spark clusters remotely through a web interface. It is ideal to connect directly into the Spark cluster; however, there are times where connecting directly to the cluster is not feasible. When facing those constraints, one can consider installing Livy in their cluster and secure it properly to enable remote use over web protocols. However, there is a significant performance overhead from using Livy in sparklyr.

To help test Livy locally, sparklyr provides support to list, install, start and stop a local Livy instance by executing:

##    livy
## 1 0.2.0
## 2 0.3.0
## 3 0.4.0
## 4 0.5.0

Which lists the versions that you can install, we recommend installing the latest version and verifying the installed version as follows

# Install default Livy version
livy_install()

# List installed Livy services
livy_installed_versions()

# Start the Livy service
livy_service_start()

You can then navigate to this local Livy session under http://localhost:8998, the Livy Connections section will detail how to connect to this local instance and also proper clusters with Livy enabled, once connected, you can navigate to the Livy web application as captured by Figure 6.16.

Apache Livy running as a local service

FIGURE 6.16: Apache Livy running as a local service

Make sure you also stop the Livy service when working with local Livy instances, for proper Livy services running in a cluster, you won’t have to.

# Stops the Livy service
livy_service_stop()

6.6 Recap

This chapter explained the history and tradeoffs of on-premise, cloud computing and presented Kubernetes as a promising framework to provide flexibility across on-premise and cloud providers. It also introduced cluster managers (Spark Standalone, YARN, Mesos and Kubernetes) as the software needed to run Spark as a cluster application. This chapter briefly mentioned on-premise cluster providers like Cloudera, Hortonworks and MapR as well as the major cloud providers: Amazon, Google and Microsoft.

While this chapter provided a solid foundation to understand current cluster computing trends, tools and providers useful to perform data science at scale; it did not provide a comprehensive framework to decide which cluster technologies to choose. Instead, use this chapter as an overview and a starting point to reach out to additional resources to help you find the cluster stack that best fits your organization needs.

The next chapter, connections, will focus on understanding how to connect to existing clusters; therefore, it assumes a Spark cluster like the ones presented in this chapter, is already available to you.

References

“Cloudera Wikipedia.” 2018. https://en.wikipedia.org/wiki/Cloudera.

“Hortonworks Wikipedia.” 2018. https://en.wikipedia.org/wiki/Hortonworks.

“Hortonworks Microsoft.” 2018. https://hortonworks.com/partner/microsoft/.

“MapR Wikipedia.” 2018. https://en.wikipedia.org/wiki/MapR.

“Databricks Wikipedia.” 2018. https://en.wikipedia.org/wiki/Databricks.

“Databricks Documentation.” 2018. https://docs.databricks.com/spark/latest/sparkr/sparklyr.html.

“Dataproc Wikipedia.” 2018. https://en.wikipedia.org/wiki/Google_Cloud_Dataproc.

“IBM Cloud Wikipedia.” 2018. https://en.wikipedia.org/wiki/IBM_cloud_computing.

“Azure Wikipedia.” 2018. https://en.wikipedia.org/wiki/Microsoft_Azure.

“RStudio Server Pro.” 2019. https://www.rstudio.com/products/rstudio-server-pro/.

“RStudio Connect.” 2019. https://www.rstudio.com/products/connect/.