Preface

In a world where information is growing exponentially, leading tools like Apache Spark, provide support to solve many of the relevant problems we face today. From companies looking for ways to improve based on data driven decisions, to research organizations solving problems in healthcare, finance, education, energy and so on; Spark enables analyzing much more information, faster, and more reliably, than ever before.

Various books have been written for learning Apache Spark; for instance, “Spark: The Definitive Guide: Big Data Processing Made Simple” (Chambers and Zaharia 2018) is a comprehensive resource while “Learning Spark: Lightning-Fast Big Data Analysis” (Karau et al. 2015) is an introductory book meant to help users get up and running. However, as of this writing, there is no book to learn Apache Spark using the R computing language and neither, a book specifically designed for the R user nor the aspiring R user.

There are some resources online to learn Apache Spark with R, most notably, the spark.rstudio.com site and the Spark documentation site under spark.apache.org. Both sites are great online resources; however, the content is not intended to be read from start to finish and assumes the reader has some knowledge of Apache Spark, R and cluster computing.

The goal of this book is to help anyone get started with Apache Spark using R. Additionally, since the R programming language was created to simplify data analysis, it is also our belief that this book provides the easiest path for anyone to learn the tools used to solve data analysis problems with Spark. The first chapters provide an introduction to help anyone get up to speed with these concepts and present the tools required to work on these problems in your own computer. We will then quickly ramp up to relevant data science topics, cluster computing, and advanced topics that should interest even the most advanced users.

Therefore, this book is intended to be a useful resource for a wide range of users; from those of you curious to learn Apache Spark, to the experienced reader seeking to understand why and how to use Apache Spark from R.

This book has the following general outline:

Introduction: In the first two chapters Introduction and Getting Started, you will learn about Apache Spark, R and the tools to perform data analysis with Spark and R.
Analysis: In the Analysis chapter, you will learn how to analyse, explore, transform and visualize data in Apache Spark with R.
Modeling: In the Modeling chapter, you will learn how to create statistical models with the purpose of extracting information and predicticting outcomes.
Scaling: Up to this point, chapters will have focused on performing operations on your personal computer; the Clusters, Connections, Data and Tuning chapters, introduce distributed computing techniques required to perform analysis and modeling across many machines to tackle the large-scale data and computation problems that Apache Spark was designed for.
Extensions: The extensions chapter describes optional components and extended functionality applicable to specific, yet relevant, use cases. You will learn about alternative modeling frameworks, graph processing at scale and model deployment topics that will be relevant to many readers at some point in time.
Advanced Topics: This book closes with a set of advanced chapters, Distributed R, Streaming and Contributing, which the advanced users will be most interested in. However, by the time you reach this section, these chapters won’t seem as intimidating; instead, they will be equally relevant, useful and interesting as the previous chapters.

Authors

Javier Luraschi

Javier is a Software Engineer with experience in technologies ranging from desktop, web, mobile and backend; to augmented reality and deep learning applications. He previously worked for Microsoft Research and SAP and holds a double degree in Mathematics and Software Engineering.

Kevin Kuo

Kevin is a software engineer working on open source packages for big data analytics and machine learning. He has held data science positions in a variety of industries and was a credentialed actuary. He likes mixing cocktails and studying about wine.

Edgar Ruiz

Edgar has a background in deploying enterprise reporting and Business Intelligence solutions. He has posted multiple articles and blog posts sharing analytics insights and server infrastructure for Data Science. He lives with his family near Biloxi, MS.

Formatting

Tables generated from code are formatted as follows:

## # A tibble: 3 x 2
##   numbers text 
##     <dbl> <chr>
## 1       1 one  
## 2       2 two  
## 3       3 three

The dimensions of the table (number of rows and columns) are described in the first row, followed by column names in the second row and column types in the third row. There are also various subtle visual improvements provided by the tibble package that we make use of throughout this book.

Plots will be rendered using the ggplot2 package and a custom ggplot theme available in the Appendix; however, since this book is not focused on data visualization, some examples make use of R’s plot() function while the figures were rendered using ggplot. If you are interested to learn more about visualization in R, consider specialized books like “R graphics cookbook: practical recipes for visualizing data” (Chang 2012).

Acknowledgments

This project would not have been possible without the work put into building sparklyr by Javier Luraschi, Kevin Kuo, Kevin Ushey and JJ Allaire, dplyr by Romain François and Hadley Wickham, DBI by Kirill Mülller, colleagues and friends in the R community nor the Apache Spark project itself.

References

Chambers, Bill, and Matei Zaharia. 2018. Spark: The Definitive Guide Big Data Processing Made Simple. 1st ed. O’Reilly Media, Inc.

Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. 2015. Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc.

Chang, Winston. 2012. R Graphics Cookbook: Practical Recipes for Visualizing Data. " O’Reilly Media, Inc.".