Chapter 1 Introduction

“You know nothing, Jon Snow.”

— Ygritte

With information growing at exponential rates, it’s no surprise that historians are referring to this period of history as the Information Age. The increasing speed at which data is being collected has created new opportunities and is certainly staged to create even more. This chapter presents the tools that have been used to solve large-scale data challenges and introduces Apache Spark as a leading tool that is democratizing our ability to process large datasets. We will then introduce the R computing language, which was specifically designed to simplify data analysis. It is then natural to ask what the outcome would be from combining the ease of use provided by R, with the compute power available through Apache Spark. This will lead us to introduce sparklyr, a project merging R and Spark into a powerful tool that is easily accessible to all.

The next chapter, Getting Started, will present the prerequisites, tools and steps you will need to perform to get Spark and R working in your personal computer. You will learn how to install Spark, initialize Spark, get introduced to common operations and help you get your very first data processing and modeling task done. It is the goal of that chapter to help anyone grasp the concepts and tools required to start tackling large-scale data challenges which, until recently, were only accessible to just a few organizations.

You will then move on into learning how to analyze large-scale data, followed by building models capable of predicting trends and discover information hidden in vast amounts of information. At which point, you will have the tools required to perform data analysis and modeling at scale. Subsequent chapters will help you move away from your local computer into computing clusters required to solve many real world problems. The last chapters will present additional topics, like real time data processing and graph analysis, which you will need to truly master the art of analyzing data at any scale. The last chapter of this book will give you tools and inspiration to consider contributing back to the Spark and R communities.

We hope this is a journey you will enjoy, that will help you solve problems in your professional career to nudge the world into taking better decisions that can benefit us all.

1.1 Overview

As humans, we have been storing, retrieving, manipulating, and communicating information since the Sumerians in Mesopotamia developed writing in about 3000 BC. Based on the storage and processing technologies employed, it is possible to distinguish four distinct phases of development: pre-mechanical (3000 BC – 1450 AD), mechanical (1450–1840), electromechanical (1840–1940), and electronic (1940–present). (Laudon, Traver, and Laudon 1996)

Mathematician George Stibitz used the word digital to describe fast electric pulses back in 1942 (Ceruzzi 2012) and, to this day, we describe information stored electronically as digital information. In contrast, analog information represents everything we have stored by any non-electronic means such as hand written notes, books, newspapers, and so on. (Webster 2006)

The world bank report on digital development provides an estimate of digital and analog information stored over the last decades. (Group 2016) This report noted that digital information surpassed analog information around 2003. At that time, there were about 10 million terabytes of digital information, which is roughly about 10 million storage drives today. However, a more relevant finding from this report was that our footprint of digital information is growing at exponential rates. Figure ?? shows the findings of this report, notice that every other year, the world’s information has grown tenfold.

World’s capacity to store information||intro-store-capacity

World’s capacity to store information||intro-store-capacity

With the ambition to provide tools capable of searching all this new digital information, many companies attempted to provide such functionality with what we know today as search engines, used when searching the web. Given the vast amount of digital information, managing information at this scale was a challenging problem. Search engines were unable to store all the web page information required to support web searches in a single computer. This meant that they had to split information into several files and store them across many machines. This approach became known as the Google File System, a research paper published in 2003 by Google. (Ghemawat, Gobioff, and Leung 2003)

1.2 Hadoop

One year later, Google published a new paper describing how to perform operations across the Google File System, this approach came to be known as MapReduce. (Dean and Ghemawat 2004) As you would expect, there are two operations in MapReduce: Map and Reduce. The map operation provides an arbitrary way to transform each file into a new file, while the reduce operation combines two files. Both operations require custom computer code, but the MapReduce framework takes care of automatically executing them across many computers at once. These two operations are sufficient to process all the data available in the web, while also providing enough flexibility to extract meaningful information from it.

For example, as illustrated in Figure ??, we can use MapReduce to count words in two different text files stored in different machines. The mapping operation splits each word in the original file and outputs a new word-counting file with a mapping of words and counts. The reduce operation can be defined to take two word-counting files and combine them by aggregating the totals for each word, this last file will contain a list of word counts across all the original files.

MapReduce example counting words across files||intro-mapreduce-example

MapReduce example counting words across files||intro-mapreduce-example

Counting words is often the most basic MapReduce example, but it can be also used for much more sophisticated and interesting applications. For instance, MapReduce can be used to rank web pages in Google’s PageRank algorithm, which assigns ranks to web pages based on the count of hyperlinks linking to a web page and the rank of the page linking to it.

After these papers were released by Google, a team in Yahoo worked on implementing the Google File System and MapReduce as a single open source project. This project was released in 2006 as Hadoop with the Google File System implemented as the Hadoop File System, or HDFS for short. The Hadoop project made distributed file-based computing accessible to a wider range of users and organizations, making MapReduce useful beyond web data processing.

While Hadoop provided support to perform MapReduce operations over a distributed file system, it still required MapReduce operations to be written with code every time a data analysis was run. To improve over this tedious process, the Hive project released in 2008 by Facebook, brought Structured Query Language (SQL) support to Hadoop. This meant that data analysis could now be performed at large-scale without the need to write code for each MapReduce operation; instead, one could write generic data analysis statements in SQL that are much easier to understand and write.

1.3 Spark

In 2009, Apache Spark began as a research project at the UC Berkeley’s AMPLab to improve on MapReduce. Specifically, by providing a richer set of verbs beyond MapReduce that facilitate optimizing code running in multiple machines, and by loading data in-memory making operations much faster than Hadoop’s on-disk storage. One of the earliest results showed that running logistic regression, a data modeling technique that will be introduced under the Modeling chapter, allowed Spark to run 10 times faster than Hadoop by making use of in-memory datasets (Zaharia et al. 2010), a chart similar to Figure ?? was presented in the original research publication.

Logistic regression performance in Hadoop and Spark||intro-spark-logistic-regression

Logistic regression performance in Hadoop and Spark||intro-spark-logistic-regression

While Spark is well known for its in-memory performance, Spark was designed to be a general execution engine that works both in-memory and on-disk. For instance, Spark has set new records in large-scale sorting (“Sort Benchmark” 2014), where data was not loaded in-memory; but rather, Spark made use of improvements in network serialization, network shuffling and efficient use of the CPU’s cache to dramatically improve performance. If you needed to sort large amounts of data, there was no other system in the world faster than Spark.

To give you a sense of how much faster and efficient Spark is, you can sort 100 terabytes of data in 72min and 2100 computers using Hadoop, but only 206 computers in 23 minutes using Spark (“Apache Spark Officially Sets a New Record in Large-Scale Sorting” 2014). In addition, Spark holds the cloud sorting benchmark (“Spark Wins Cloudsort Benchmark as the Most Efficient Engine” 2016), which made Spark the most cost effective solution for sort large-datasets in the cloud.

Hadoop Record Spark Record
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
Nodes 2100 206
Cores 50400 6592
Disk 3150 GB/s 618 GB/s
Network 10Gbps 10Gbps
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate / node 0.67 GB/min 20.7 GB/min

Spark is also easier to use than Hadoop; for instance, the word-count MapReduce example takes about fifty lines of code in Hadoop, but it only takes two lines of code in Spark – Spark is much faster, efficient and easier to use than Hadoop.

In 2010, Spark was released as an open source project and then donated to the Apache Software Foundation in 2013. Spark is licensed under the Apache 2.0, which allows you to freely use, modify, and distribute it. Spark then reached over 1000 contributors, making it one of the most active projects in the Apache Software Foundation.

This gives an overview of how Spark came to be, which we can now use to formally introduce Apache Spark as follows:

“Apache Spark is a unified analytics engine for large-scale data processing.”

To help us understand this definition of Apache Spark, we will break it down as follows:

Spark supports many libraries, clusters technologies and storage systems.
Analytics is the discovery and interpretation of data to produce and communicate information.
Spark is expected to be efficient and generic.
One can interpret large-scale as cluster-scale, a set of connected computers working together.

Spark is described as an engine since it’s generic and efficient, it optimizes and executes generic code; there are no restrictions as to what type of code one can write in Spark. It is also quite efficient, as we mentioned, Spark is much faster than other technologies by making efficient use of memory, network and CPUs to speed data processing algorithms in computing clusters.

This makes Spark ideal in many analytics projects like ranking movies at Netflix (“Netflix at Spark” 2018), aligning protein sequences (“Bioinformatics Applications on Apache Spark” 2018) or analyzing high energy physics at CERN. (“Apache Spark and Cern Open Data Analysis, an Example” 2017)

As a unified platform, Spark is expected to support many cluster technologies and multiple data sources, which you will learn about in the Clusters and Data chapters. It is also expected to support many different libraries like Spark SQL, MLlib, GraphX and Spark Streaming; libraries which you can use for analysis, modeling, graph processing and real-time data processing respectively. You can think of Spark as a platform providing access to clusters, data sources and libraries for large-scale computing as illustrated in Figure ??.

Spark as a unified analytics engine for large-scale data processing||intro-spark-unified

Spark as a unified analytics engine for large-scale data processing||intro-spark-unified

Describing Spark as large scale implies that a good use case for Spark is tackling problems that can be solved with multiple machines. For instance, when data does not fit in a single disk driver or does not fit into memory, Spark is a good candidate to consider. However, it can also be considered for problems that may not be large-scale, but where using multiple computers could speed up computation. For instance, CPU intensive models or scientific simulations also benefit from running in Spark.

Therefore, Spark is good at tackling large-scale data processing problems, this usually known as big data (datasets that are more voluminous and complex that traditional ones (“Big Data Wikipedia” 2019)) but also is good at tackling large-scale computation problems, known as big compute (tools and approaches using a large amount of CPU and memory resources in a coordinated way (“Big Compute Nimbix” 2019)). Big data often requires big compute but big compute does not necessarily requires big data. (“Big Compute Vs Big Data” 2013)

Big data and big compute problems are usually easy to spot – if the data does not fit into a single machine, you might have a big data problem; if the data fits into a single machine but processing it takes days, weeks or even months to compute, you might have a big compute problem.

However, there is also a third problem space where neither data nor compute are necessarily large-scale and yet, there are significant benefits to using cluster computing frameworks like Spark. For this third problem space, there are a few use cases this breaks to:

Suppose you have a dataset of 10 gigabytes in size and a process that takes 30 minutes to run over this data – this is by no means big compute nor big data. However, if you happen to be researching ways to improve the accuracy of your models, reducing the runtime down to 3 minutes is a significant improvement, which can lead to significant advances and productivity gains by increasing the velocity at which you can analyze data. Alternatively, you might need to process data faster, for stock trading for instance, while 3 minutes could seem fast enough; it can be way too slow for real-time data processing, where you might need to process data in a few seconds – or even down to a few milliseconds.
You could also have an efficient process to collect data from many sources into a single location, usually a database, this process could be already running efficiently and close to real-time. Such processes are known at ETL (Extract-Transform-Load); data is extracted from multiple sources, transformed to the required format and loaded into a single data store. While this has worked for years, the tradeoff from this approach is that adding a new data source is expensive. Since the system is centralized and tightly controlled, making changes could cause the entire process to halt; therefore, adding new data source usually takes too long to be implemented. Instead, one can store all data its natural format and process it as needed using cluster computing, this architecture is currently known as a data lake. In addition, storing data in its raw format allows you to process a variety of new file formats like images, audio and video; without having to figure out how to fit them into conventional structured storage systems.
When using many data sources, their data quality might vary greatly between them; which requires special analysis methods to improve its accuracy. For instance, suppose you have a table of cities with values like San Francisco, Seattle and Boston, what happens when data contains a misspelled entry like “Bston”? In a relational database, this invalid entry might get dropped; however, dropping values is not necessarily the best approach in all cases, you might want to correct this field by making use of geocodes, cross referencing data sources or attempting a best-effort match. Therefore, understanding and improving the veracity of the original data source can lead to more accurate results.

If we include “Volume” as a synonym for big data, you get the mnemonics people refer as the four ’V’s of big data; others have gone as far as expanding this to five or even as the 10 Vs of Big Data. Mnemonics aside, cluster computing is being used today in more innovative ways and and is not uncommon to see organizations experimenting with new workflows and a variety of tasks that were traditionally uncommon for cluster computing. Much of the hype attributed to big data falls into this space where, strictly speaking, one is not handling big data but there are still benefits from using tools designed for big data and big compute.

Our hope is that this book will help you understand the opportunities and limitations of cluster computing, and specifically, the opportunities and limitations from using Apache Spark with R.

1.4 R

The R computing language has its origins in the S language, created at Bell Laboratories. R was not created at Bell Labs, but its predecessor, the S computing language was. Rick Becker explained in useR 2016 (“The History of R’s Predecessor, S, from Co-Creator Rick Becker” 2016) that at that time in Bell Labs, computing was done by calling subroutines written in the Fortran language which, apparently, were not pleasant to deal with. The S computing language was designed as an interface language to solve particular problems without having to worry about other languages, such as Fortran. The creator of S, John Chambers, describes in Figure ?? how S was designed to provide an interface that simplifies data processing, this was presented during useR! 2016 as the original diagram that inspired the creation of S.

Interface language diagram by John Chambers - Rick Becker useR 2016||intro-r-diagram

Interface language diagram by John Chambers - Rick Becker useR 2016||intro-r-diagram

R is a modern and free implementation of S, specifically:

R is a programming language and free software environment for statistical computing and graphics.

The R Project for Statistical Computing

While working with data, we believe there are two strong arguments for using R:

R Language
R was designed by statisticians for statisticians, meaning, this is one of the few successful languages designed for non-programmers; so learning R will probably feel more natural. Additionally, since the R language was designed to be an interface to other tools and languages, R allows you to focus more on understanding data and less on peculiarities of computer science and engineering.
R Community
The R community provides a rich package archive provided by CRAN (The Comprehensive R Archive Network) which allows you to install ready-to-use packages to perform many tasks; most notably, high-quality data manipulation, visualization and statistical models, many of which are only available in R. In addition, the R community is a welcoming and active group of talented individuals motivated to help you succeed. Many packages provided by the R community make R, by far, the best option for statistical computing. Some of the most downloaded R packages include: dplyr to manipulate data, cluster to analyze clusters and ggplot2 to visualize data. Figure ?? quantifies the growth of the R community by plotting daily downloads of R packages in CRAN.
Daily downloads of CRAN packages||intro-cran-downloads

Daily downloads of CRAN packages||intro-cran-downloads

Aside from statistics, R is also used in many other fields. The following ones are particularly relevant to this book:

Data Science
Data science is based on knowledge and practices from statistics and computer science that turns raw data into understanding (Wickham and Grolemund 2016) by using data analysis and modeling techniques. Statistical methods provide a solid foundation to understand the world and perform predictions, while the automation provided by computing methods allows us to simplify statistical analysis and make it much more accessible. Some have advocated that statistics should be renamed data science; (Wu 1997) however, data science goes beyond statistics by also incorporating advances in computing. (Cleveland 2001) This book presents analysis and modeling techniques common in statistics, but applied to large datasets which requires incorporating advances in distributed computing.
Machine Learning
Machine learning uses practices from statistics and computer science; however, it is heavily focused on automation and prediction. For instance, the term “machine learning” was coined by Arthur Samuel while automating a computer program to play checkers. (Samuel 1959) While we could perform data science on particular games, writing a program to play checkers requires us to automate the entire process. Therefore, this falls in the realm of machine learning, not data science. Machine learning makes it possible for many users to take advantage of statistical methods without being aware of the statistical methods that are being used. One of the first important applications of machine learning was to filter spam emails; in this case, it’s just not feasible to perform data analysis and modeling over each email account; therefore, machine learning automates the entire process of finding spam and filtering it out without having to involve users at all. This book will present the methods to transition data science workflows into fully-automated machine learning methods; for instance, by providing support to build and export Spark pipelines that can be easily reused in automated environments.
Deep Learning
Deep learning builds on knowledge of statistics, data science and machine learning to define models vaguely inspired on biological nervous systems. Deep learning models evolved from neural network models after the vanishing-gradient-problem was resolved by training one layer at a time (Hinton, Osindero, and Teh 2006) and have proven useful in image and speech recognition tasks. For instance, when using voice assistants like Siri, Alexa, Cortana or Google, the model performing the audio to text conversion is most likely to be based on deep learning models. While GPUs (Graphic Processing Units) have been successfully used to train deep learning models; (Krizhevsky, Sutskever, and Hinton 2012) some datasets can not be processed in a single GPU. It is also the case that deep learning models require huge amounts of data, which needs to be preprocessed across many machines before they can be fed into a single GPU for training. This book won’t make any direct references to deep learning models; however, the methods presented in this book can be used to prepare data for deep learning and, in the years to come, using deep learning with large-scale computing will become a common practice. In fact, recent versions of Spark have already introduced execution models optimized for training deep learning in Spark.

While working in any of the previous fields, you will be faced with increasingly large datasets or increasingly complex computations that are slow to execute or at times, even impossible to process in a single computer. However, it is important to understand that Spark does not need to be the answer to all our computations problems; instead, when faced with computing challenges in R, the following techniques can be as effective:

A first approach to try is reduce the amount of data being handled, through sampling. However, data must be sampled properly by applying sound statistical principles. For instance, selecting the top results is not sufficient in sorted datasets; with simple random sampling, there might be underrepresented groups, which we could overcome with stratified sampling, which in turn adds complexity to properly select categories. It is out of the scope of this book to teach how to properly perform statistical sampling, but many resources are available on this topic.
One can try to understand why a computation is slow and make the necessary improvements. A profiler, is a tool capable of inspecting code execution to help identify bottlenecks. In R, the R profiler, the profvis R package (“Profvis” 2018) and RStudio profiler feature (“RStudio Profiler” 2018), allow you to easily to retrieve and visualize a profile; however, it’s not always trivial to optimize.
Scaling Up
Speeding up computation is usually possible by buying faster or more capable hardware, say, increasing your machine memory, hard drive or procuring a machine with many more CPUs, this approach is known as “scaling up”. However, there are usually hard limits as to how much a single computer can scale up and even with significant CPUs, one needs to find frameworks that parallelize computation efficiently.
Scaling Out
Finally, we can consider spreading computation and storage across multiple machines; this approach provides the highest degree of scalability since one can potentially use an arbitrary number of machines to perform a computation, this approach is commonly known as “scaling out”. However, spreading computation effectively across many machines is a complex endeavour, specially without using specialized tools and frameworks like Apache Spark.

This last point brings us closer to the purpose of this book, which is to bring the power of distributed computing systems provided by Apache Spark, to solve meaningful computation problems in Data Science and related fields, using R.

1.5 sparklyr

When you think of the computation power that Spark provides and the ease of use of the R language, it is natural to want them to work together – seamlessly. This is also what the R community expected, an R package that would provide an interface to Spark that was: easy to use, compatible with other R packages and, available in CRAN; with this goal, we started developing sparklyr. The first version, sparklyr 0.4, was released during the useR! 2016 conference, this first version included support for dplyr, DBI, modeling with MLlib and an extensible API that enabled extensions like H2O’s rsparkling package. Since then, many new features and improvements have been made available through sparklyr 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0.

Officially, sparklyr is an R interface for Apache Spark. It’s available in CRAN and works like any other CRAN package, meaning that: it’s agnostic to Spark versions, it’s easy to install, it serves the R community, it embraces other packages and practices from the R community and so on. It’s hosted in GitHub under and licensed under Apache 2.0 which allows you to clone, modify and contribute back to this project.

While thinking of who should use sparklyr, the following roles come to mind:

  • New Users: For new users, it is our belief that sparklyr provides the easiest way to get started with Spark. Our hope is that the early chapters of this book will get you up running with ease and set you up for long term success.
  • Data Scientists: For data scientists that already use and love R, sparklyr integrates with many other R practices and packages like dplyr, magrittr, broom, DBI, tibble, rlang and many others that will make you feel at home while working with Spark. For those new to R and Spark, the combination of high-level workflows available in sparklyr and low-level extensibility mechanisms make it a productive environment to match the needs and skills of every data scientist.
  • Expert Users: For those users that are already immersed in Spark and can write code natively in Scala, consider making your Spark libraries available as an R package to the R community, a diverse and skilled community that can put your contributions to good use while moving open science forward.

We wrote this book to describe and teach the exciting overlap between Apache Spark and R. sparklyr is the R package that materializes this overlap of communities, expectations, future directions, packages, and package extensions as well. We believe there is an opportunity to use this book to bridge the R and Spark communities, to present to the R community why Spark is exciting and to the Spark community what makes R great. Both communities are solving very similar problems with a set of different skills and backgrounds; therefore, it is our hope that sparklyr can be a fertile ground for innovation, a welcoming place for newcomers, a productive place for experienced data scientists and an open community where cluster computing, data science and machine learning can come together.

1.6 Recap

This chapter presented Spark as a modern and powerful computing platform, R as an easy-to-use computing language with solid foundations in statistical methods and, sparklyr, as a project bridging both technologies and communities together. In a world where the total amount of information is growing exponentially, learning how to analyze data at scale will help you tackle the problems and opportunities humanity is facing today. However, before we start analyzing data, the Getting Started chapter will equip you with the tools you will need through the rest of this book. We recommend you follow each step carefully and take the time to install the recommended tools which, we hope will become familiar tools that you use and love.


Laudon, Kenneth C, Carol Guercio Traver, and Jane P Laudon. 1996. “Information Technology and Systems.” Cambridge, MA: Course Technology.

Ceruzzi, Paul E. 2012. Computing: A Concise History. MIT Press.

Webster, Merriam. 2006. “Merriam-Webster Online Dictionary.” Webster, Merriam.

Group, World Bank. 2016. The Data Revolution. World Bank Publications.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. 2003. “The Google File System.” In Proceedings of the Nineteenth Acm Symposium on Operating Systems Principles. New York, NY, USA: ACM.

Dean, Jeffrey, and Sanjay Ghemawat. 2004. “MapReduce: Simplified Data Processing on Large Clusters.” In USENIX Symposium on Operating System Design and Implementation (Osdi).

Zaharia, Matei, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. “Spark: Cluster Computing with Working Sets.” HotCloud 10 (10-10): 95.

“Sort Benchmark.” 2014.

“Apache Spark Officially Sets a New Record in Large-Scale Sorting.” 2014.

“Spark Wins Cloudsort Benchmark as the Most Efficient Engine.” 2016.

“Netflix at Spark.” 2018.

“Bioinformatics Applications on Apache Spark.” 2018.

“Apache Spark and Cern Open Data Analysis, an Example.” 2017.

“Big Data Wikipedia.” 2019.

“Big Compute Nimbix.” 2019.

“Big Compute Vs Big Data.” 2013.

“The History of R’s Predecessor, S, from Co-Creator Rick Becker.” 2016.

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

Wu, C.F. Jeff. 1997. “Statistics = Data Science?”

Cleveland, William S. 2001. “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics?”

Samuel, Arthur L. 1959. “Some Studies in Machine Learning Using the Game of Checkers.” IBM Journal of Research and Development 3 (3): 210–29.

Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh. 2006. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18 (7): 1527–54.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, 1097–1105.

“Profvis.” 2018.

“RStudio Profiler.” 2018.