Chapter 3 Analysis

Previous chapters focused on introducing Spark and R. They also focused on helping you get started with the tools you need throughout this book. In this chapter, you will learn how to perform data analysis in Spark from R.

Data analysis will likely become the most common task when working with Spark. This chapter will serve as a foundation to later chapters. The concepts from this chapter will apply to properly prepare data when doing modeling, graph processing, streaming and other related topics, that might not be strictly considered data analysis.

The sparklyr package implements data analysis principles of R and specifically a subset of R packages called the tidyverse. If you are new to R, Data Science, or the tidyverse, it is recommended to read the book “R for Data Science” (Wickham and Grolemund 2016). This chapter will try to briefly introduce all the concepts it presents.

3.1 Search for an answer

In a data analysis project, the main goal is to gain understanding of what the data is trying to “tell us”. The hope is that the data provides an answer to a specific question. The question is typically poised by stakeholders of the analysis. Most data analysis projects follow a set of steps outlined in Figure 3.1.

FIGURE 3.1: General steps of a data analysis

As the diagram illustrates, the data is imported into our analysis stem, then wrangled by trying different data transformations, such as aggregations, and then visualized to help us perceive relationships and trends. In order to get deeper insight, one or multiple statistical models can be fitted against sample data. This will help in finding out if the patterns hold true when new data is applied to them. And lastly, the results are communicated with stakeholders.

Commonly, in R all of the steps are performed in local memory. But that approach has to change when the analysis adds the use of Spark. The next section will introduce the main concept of how to best integrate R and Spark for data analysis.

3.2 R as an interface to Spark

For data analysis, the ideal approach is to let Spark do what its good at. It excels at being a parallel computation engine that works at a large scale. Spark goes beyond offering generic calculations. Out of the box, Spark includes libraries that actually can do a lot of what analysts usually do in R, but for large amounts of data. Figure 3.2 paraphrases the four main capabilities available to data analysts in Spark.

FIGURE 3.2: Spark capabilities

Thanks to Spark’s libraries, most of the data analysis project steps can be completed inside Spark. For example, selecting, transforming and modeling can all be done by Spark. The idea is to use R to tell Spark what data operations to run, and then focus on only bringing back into R the results of the operation.

FIGURE 3.3: R as an interface for Spark

The sparklyr package focuses on implementing the principle mentioned in the previous section. Most of its functions are mainly wrappers on top of Spark API calls.

The idea is take advantage of Spark’s analysis components instead of R’s. For example, if the analyst needs to fit a Linear Regression model, instead of using the familiar lm() function, for data available via Spark, the analyst would use the ml_linear_regression() function. The R function will actually run Scala code that runs the Spark’s API model.

FIGURE 3.4: R functions call Scala functions

For more common data manipulation tasks, sparklyr provides a back-end for dplyr. This means that already familiar dplyr verbs can be used in R, and then sparklyr and dplyr will translate those actions into Spark SQL statements, see figure 3.5.

FIGURE 3.5: dplyr writes SQL

An example of a practical implementation of this concept will be covered in the following sections.

3.3 Exercise

In order to practice as you learn, the rest of this chapter’s code will use a single exercise that runs in the local Spark master, this allows the code to be replicated in your laptop. Please, make sure to already have sparklyr and a local copy of Spark installed. The installation of Spark can be done by using the utility that comes with the package. For more information on how to do that please see the Local section in the Connections chapter.

First, load the sparklyr and dplyr packages, and open a new local connection.

library(sparklyr)
sc <- spark_connect(master = "local")

The environment is ready to be used. The next step is to add data that we can analyze.

3.4 Import / Access

The step of importing data is to be approached differently when using Spark with R, as oppose to R alone. When used alone, importing data means that R will read files and import the information into memory But when used with Spark, it is important to only focus in importing results into R. The data is either imported or accessed by Spark. This way, the actual analysis takes place inside the Spark session.

FIGURE 3.6: Import Data to Spark not R

An example of accessing versus importing is found in an enterprise environment. More likely, Spark sessions are created on top of Hadoop clusters, so data would already be available to be accessed directly by Spark via either a Hive table, or through the Hadoop File System (HDFS).

The decision of having Spark either access the data source or to import data into memory is mostly a decisions based on speed and performance. That will be covered in the Tuning chapter.

The exercise’s Spark session does not have any data. So the next step is to prime the session with data, in this case mtcars. The copy_to() command from dplyr can be used for that. The mechanics of this operation is explained in the Getting Started chapter.

library(dplyr)
cars <- copy_to(sc, mtcars, "mtcars_remote")

Note: In an enterprise setting, copy_to() should only be used to transfer small tables from R, such as a look up value table.

The data is now accessible to Spark and R. Transformations can now be applied to the data. The next section will cover how to wrangle data by running transformations inside Spark.

3.5 Wrangle

Wrangling data involves cleaning the data, and then explore it. The idea is to take the original data source of data and apply transformation to it. A data transformation can be interpreted as changes we perform to the data in order to understand it. An example of an transformation is an aggregation. The result of aggregating a variable is structurally different from the original data set. The initial data set had 32 rows and 10 columns, the result of the aggregation will be a single row and column.

The main goal is to write the data transformations using R syntax as much as possible. This saves us from the cognitive cost of having to switch between multiple computer languages to accomplish a single task. In this case, it is better to take advantage of sparklyr’s dplyr back-end interface, instead of writing Spark SQL statements for data exploration,

In the R environment, cars can be treated as if it is a local data frame, so dplyr verbs can be used, and in a piped fashion.

cars %>%
  group_by(am) %>%
  summarise(mpg_mean = mean(mpg, na.rm = TRUE))
## # Source: spark<?> [?? x 2]
##      am mpg_mean
##   <dbl>    <dbl>
## 1     0     17.1
## 2     1     24.4

Instead of importing the mtcars_remote data set from Spark, and then performing the aggregation, dplyr converts the verbs into SQL statements that are then sent to Spark. The show_query() command makes it possible to peer into the SQL statement that sparklyr and dplyr created and sent to Spark.

cars %>%
  group_by(am) %>%
  summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>%
  show_query()
## <SQL>
## SELECT `am`, AVG(`mpg`) AS `mpg_mean`
## FROM `mtcars_remote`
## GROUP BY `am`

As it is evident, it will not be necessary to have to see the resulting query every time dplyr verbs are being used. The focus can remain on obtaining insights from the data, as opposed to figuring out how to express a given set of transformation in SQL.

cars %>%
  group_by(am) %>%
  summarise(
    wt_mean = mean(wt, na.rm = TRUE),
    mpg_mean = mean(mpg, na.rm = TRUE)
  )
## # Source: spark<?> [?? x 3]
##      am wt_mean mpg_mean
##   <dbl>   <dbl>    <dbl>
## 1     0    3.77     17.1
## 2     1    2.41     24.4

Most of the data transformation made available by dplyr to work with local data frames are also available to use with a Spark connection. This means that a general approach to learning dplyr can be taken in order to gain more proficiency with data exploration and preparation with Spark. The chapter on Data Transformation in the R for Data Science (Wickham and Grolemund 2016) book should be a great help with this. If proficiency with dplyr is not an issue for you, then please take some time to experiment with different dplyr functions against the cars table.

3.5.1 Correlations

A very common exploration technique is to calculate and visualize correlations. The Spark API provides an internal function that calculates correlations across the entire data set. The results are then returned to R as a data.frame object.

ml_corr(cars) %>%
  as_tibble()
## # A tibble: 11 x 11
##       mpg    cyl   disp     hp    drat     wt    qsec
##     <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
##  1  1     -0.852 -0.848 -0.776  0.681  -0.868  0.419 
##  2 -0.852  1      0.902  0.832 -0.700   0.782 -0.591 
##  3 -0.848  0.902  1      0.791 -0.710   0.888 -0.434 
##  4 -0.776  0.832  0.791  1     -0.449   0.659 -0.708 
##  5  0.681 -0.700 -0.710 -0.449  1      -0.712  0.0912
##  6 -0.868  0.782  0.888  0.659 -0.712   1     -0.175 
##  7  0.419 -0.591 -0.434 -0.708  0.0912 -0.175  1     
##  8  0.664 -0.811 -0.710 -0.723  0.440  -0.555  0.745 
##  9  0.600 -0.523 -0.591 -0.243  0.713  -0.692 -0.230 
## 10  0.480 -0.493 -0.556 -0.126  0.700  -0.583 -0.213 
## 11 -0.551  0.527  0.395  0.750 -0.0908  0.428 -0.656 
## # ... with 4 more variables: vs <dbl>, am <dbl>,
## #   gear <dbl>, carb <dbl>

The corrr R package specializes in correlations. It contains friendly functions to prepare and visualize the results. Included inside the package is a back-end for sparklyr table objects, so it will not return an error if a non local table is passed to it. In the background, the correlate() function runs ml_corr(), so there is no need to collect any data into R prior running the command.

library(corrr)

cars %>%
  correlate(use = "pairwise.complete.obs", method = "pearson") 
## # A tibble: 11 x 12
##    rowname     mpg     cyl    disp      hp     drat      wt
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
##  1 mpg      NA      -0.852  -0.848  -0.776   0.681   -0.868
##  2 cyl      -0.852  NA       0.902   0.832  -0.700    0.782
##  3 disp     -0.848   0.902  NA       0.791  -0.710    0.888
##  4 hp       -0.776   0.832   0.791  NA      -0.449    0.659
##  5 drat      0.681  -0.700  -0.710  -0.449  NA       -0.712
##  6 wt       -0.868   0.782   0.888   0.659  -0.712   NA    
##  7 qsec      0.419  -0.591  -0.434  -0.708   0.0912  -0.175
##  8 vs        0.664  -0.811  -0.710  -0.723   0.440   -0.555
##  9 am        0.600  -0.523  -0.591  -0.243   0.713   -0.692
## 10 gear      0.480  -0.493  -0.556  -0.126   0.700   -0.583
## 11 carb     -0.551   0.527   0.395   0.750  -0.0908   0.428
## # ... with 5 more variables: qsec <dbl>, vs <dbl>,
## #   am <dbl>, gear <dbl>, carb <dbl>

The correlate() function returns a local R object that corrr recognizes. This way, it is easy to perform more functions on top of the results. In this case, the shave() command turns all of the duplicated results into NA’s

cars %>%
  correlate(use = "pairwise.complete.obs", method = "pearson") %>%
  shave()
## # A tibble: 11 x 12
##    rowname     mpg     cyl    disp      hp     drat      wt
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
##  1 mpg      NA      NA      NA      NA      NA       NA    
##  2 cyl      -0.852  NA      NA      NA      NA       NA    
##  3 disp     -0.848   0.902  NA      NA      NA       NA    
##  4 hp       -0.776   0.832   0.791  NA      NA       NA    
##  5 drat      0.681  -0.700  -0.710  -0.449  NA       NA    
##  6 wt       -0.868   0.782   0.888   0.659  -0.712   NA    
##  7 qsec      0.419  -0.591  -0.434  -0.708   0.0912  -0.175
##  8 vs        0.664  -0.811  -0.710  -0.723   0.440   -0.555
##  9 am        0.600  -0.523  -0.591  -0.243   0.713   -0.692
## 10 gear      0.480  -0.493  -0.556  -0.126   0.700   -0.583
## 11 carb     -0.551   0.527   0.395   0.750  -0.0908   0.428
## # ... with 5 more variables: qsec <dbl>, vs <dbl>,
## #   am <dbl>, gear <dbl>, carb <dbl>

Finally, the results can be easily visualized using rplot(). This function returns a ggplot object.

cars %>%
  correlate(use = "pairwise.complete.obs", method = "pearson") %>%
  shave() %>%
  rplot()
Using rplot() to visualize correlations

FIGURE 3.7: Using rplot() to visualize correlations

It is much easier to see which relationships are positive or negative, as well as significant. The power of a visualizing data is in how much easier it makes it for us to understand results. The next section will expand on this step of the process.

3.6 Visualize

Visualizations are fundamentally a human task. They are a vital tool to help us find patterns from data. It is easier for us to identify outliers in a data set of 1,000 observations when plotted in a graph, as opposed to reading them from a list.

R is great at data visualizations. Its capabilities for creating plots is extended by the many R packages that focus on this analysis step. Unfortunately, the vast majority of R functions that create plots depend on the data already being in local memory within R, so they fail when using a remote table inside Spark.

It is possible to create visualizations in R from data source from Spark. To understand how to do this, let’s first break down how computer programs build plots:

FIGURE 3.8: Stages of a plot

For example, to create a bar plot in R, we simply call a function:

ggplot(aes(as.factor(cyl), mpg), data = mtcars) + geom_col()
Plotting inside R

FIGURE 3.9: Plotting inside R

In this case, the mtcars raw data was automatically transformed into three discrete aggregated numbers, then each result was mapped into an x and y plane, and then the plot was drawn, see figure 3.8. As R users, all of the stages of building the plot are conveniently abstracted for us.

FIGURE 3.10: R plotting function

3.6.2 Simple plots

There are a couple of key steps when codifying the “Transform remotely, plot locally” approach. First, ensure that the transformation operations happen inside Spark. In the example below, group_by() and summarise() will run as SQL inside the Spark session. The second is to bring the results back into R after the data has been transformed. Make sure to transform and then collect, in that order, because if collect() is run first, then all R will try to ingest the entire data set from Spark. Depending on the size of the data, collecting all of the data will slow down or may even bring down your system.

car_group <- cars %>%
  group_by(cyl) %>%
  summarise(mpg = sum(mpg, na.rm = TRUE)) %>%
  collect()

car_group
## # A tibble: 3 x 2
##     cyl   mpg
##   <dbl> <dbl>
## 1     6  138.
## 2     4  293.
## 3     8  211.

In the example, now that the data has been pre-aggregated and collected into R, only three records are passed to the plotting function.

ggplot(aes(as.factor(cyl), mpg), data = car_group) + geom_col() + ggsave("images/analysis-visualizations-1.png")
Plot from Spark

FIGURE 3.12: Plot from Spark

Thanks to the consistency among the tidyverse packages, the entire operation can be written in a single piped code segment:

cars %>%
  group_by(cyl) %>%
  summarise(mpg = sum(mpg, na.rm = TRUE)) %>%
  collect() %>%
  ggplot(aes(as.factor(cyl), mpg)) + 
  geom_col()
Plot from Spark

FIGURE 3.13: Plot from Spark

Using this approach, most visualizations can be easily produced.

3.6.3 Histograms

There are plots that are both useful and commonly used, but their calculations are not easily reproducible. For example, producing a histogram that runs over the entire large data set has not been typically feasible, because the data needs to be imported into R first.

The formula breaks down the creation of bins 3 mpg wide into a combination of the most basic aggregation functions. It can run easily inside the mutate() command. It creates as many discrete bins size 3 mpg, and passes the minimum value as the “label” of that bin.

mtcars %>%
  mutate(
    fx    = floor((mpg - min(mpg, na.rm = TRUE))/3),
    min_x = min(mpg, na.rm = TRUE),
    max_x = max(mpg, na.rm = TRUE),
    bins  = (3 * ifelse(fx == (max_x - min_x)/3, fx - 1, fx)) + min_x
  ) %>%
  select(mpg, bins) %>%
  head()
##    mpg bins
## 1 21.0 19.4
## 2 21.0 19.4
## 3 22.8 22.4
## 4 21.4 19.4
## 5 18.7 16.4
## 6 18.1 16.4

The same R formula can run inside Spark, dplyr will translate the R code into a SQL statement. The results can then be collected into R for visualizing.

bins <- cars %>%
  mutate(
    fx    = floor((mpg - min(mpg, na.rm = TRUE))/3),
    min_x = min(mpg, na.rm = TRUE),
    max_x = max(mpg, na.rm = TRUE),
    bins  = (3 * ifelse(fx == (max_x - min_x)/3, fx - 1, fx)) + min_x
  ) %>%
  count(bins)%>%
  collect()

bins
## # A tibble: 7 x 2
##    bins     n
##   <dbl> <dbl>
## 1  19.4     6
## 2  22.4     3
## 3  13.4     8
## 4  10.4     3
## 5  16.4     6
## 6  28.4     4
## 7  25.4     2

Since the bins and counts have been pre-calculated, a simple column plot is used.

bins %>%
  ggplot() +
  geom_col(aes(bins, n))
Plot from Spark

FIGURE 3.14: Plot from Spark

3.6.3.1 Using dbplot

The dbplot package provides helper function for plotting with remote data. The pakcage uses dplyr to push the calculations to Spark and collects the results.

The dbplot_histogram() function will have Spark calculate the bins and the count per bin, and then outputs a ggplot object. It accepts a binwidth argument.

library(dbplot)

cars %>%
  dbplot_histogram(mpg, binwidth = 3)
Plot from Spark

FIGURE 3.15: Plot from Spark

Because it is a ggplot object, it can be further refined if needed.

cars %>%
  dbplot_histogram(mpg, binwidth = 3) +
  labs(title = "Histogram of Miles Per Galon")
cars %>%
  dbplot_histogram(mpg, binwidth = 3) +
  labs(title = "Histogram of Miles Per Galon") +
  ggsave("images/analysis-visualizations-4.png", width = 10, height = 5)
Plot from Spark

FIGURE 3.16: Plot from Spark

The package also provides a way to obtain the raw results via the db_compute_bins() package.

cars %>%
  db_compute_bins(mpg, binwidth = 3) 
## # A tibble: 7 x 2
##     mpg count
##   <dbl> <dbl>
## 1  19.4     6
## 2  22.4     3
## 3  13.4     8
## 4  10.4     3
## 5  16.4     6
## 6  28.4     4
## 7  25.4     2

3.6.4 Scatter plots

Scatter plots are incredibly useful because of their multiple applications. Very commonly, they are used to compare the relationship between two continuous variables. The problems that arise when trying to use this visualization with a large amount of data are:

  • Performance problems - Too many single dots have to be calculated and drawn.

  • Perception problems - It becomes hard to see the true amount of dots in a single area.

No amount of “pushing the computation” to Spark will help with this problem because the data has to be plotted in individual dots.

The best alternative is to find a plot type that represents the x/y relationship and concentration in a way that it is easy to perceive and to “physically” plot. The raster plot may be the best answer. It returns a grid of x/y positions and the results of a given aggregation usually represented by the color of the square.

The dbplot package provides functionality that helps with this kind of plotting. The first one is dbplot_raster(). It is the way to quickly run a visualization of this kind.

cars %>%
  dbplot_raster(mpg, wt, resolution = 5)
Plot from Spark

FIGURE 3.17: Plot from Spark

As shown in figure [], the plot returns a grid no bigger than 5x5. This limits the number of records that need to be collected into R to 25.

If the user prefers to either visualize, or simply obtain the result data, the db_compute_raster() and db_compute_raster2() functions return the data. The db_compute_raster2() includes the upper and lower bounds of each square.

cars %>%
  db_compute_raster2(mpg, wt, resolution = 5)
## # A tibble: 9 x 5
##     mpg    wt `n()` mpg_2  wt_2
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  19.8  2.30     5  24.5  3.08
## 2  19.8  3.08     3  24.5  3.86
## 3  15.1  3.08    10  19.8  3.86
## 4  10.4  3.08     3  15.1  3.86
## 5  15.1  3.86     1  19.8  4.64
## 6  10.4  4.64     3  15.1  5.42
## 7  29.2  1.51     4  33.9  2.30
## 8  24.5  1.51     2  29.2  2.30
## 9  15.1  2.30     1  19.8  3.08

3.7 Model

The analysis project focuses on going through as many transformations and models to find the answer. The ideal data analysis framework enables the user to quickly and easily complete each iteration. The transition from an analysis to a deployment project occurs after the model is selected, and findings are presented to the stakeholders. The Modeling chapter will dive deeper into how to prepare and run models. The focus of this section will be how to properly and easily transition from the wrangling to modeling.

3.7.1 Models during analysis

There are some steps needed for Spark to run even the most simple models. The modeling functions in sparklyr already cover those steps in order to make them easier to use. Transitioning from data wrangling to model prototyping is as easy as piping the transformed code right into the a sparklyr modeling function. Consider the following example:

cars %>% 
  mutate(cyl = paste0("cyl_", cyl)) %>%
  ml_linear_regression(wt ~ .) %>%
  summary()
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36805 -0.18275 -0.01221  0.11616  0.56768 
## 
## Coefficients:
##  (Intercept)          mpg  cyl_cyl_8.0  cyl_cyl_4.0         disp 
## -0.241536145 -0.046030695  0.085662905  0.193862495  0.006621045 
##           hp         drat         qsec           vs           am 
## -0.004654348 -0.126435269  0.189135280  0.017628206  0.037101380 
##         gear         carb 
## -0.081473252  0.281287273 

The cyl field is converted into a character variable, and then the ml_linear_regression() function is applied to the resulting data set. This is a very straight forward thing to do inside R, but it is not so in Spark. Firstly, Spark does not run models without a Spark PipelineModel. Secondly, it does not create Dummy Variables by default. The ml_linear_regression() already encases all of the needed Scala code to complete those steps and run the Pipeline model, which is then returned looking as a standard fitted model in R.

At this point it is very easy to experiment with different formulas, as shows in the code below:

cars %>%
  ml_linear_regression(wt ~ mpg) %>%
  summary()
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6516 -0.3490 -0.1381  0.3190  1.3684 
## 
## Coefficients:
## (Intercept)         mpg 
##    6.047255   -0.140862 
## 
## R-Squared: 0.7528
## Root Mean Squared Error: 0.4788

Additionally, it is also very easy to try out other kinds of models:

cars %>%
  ml_generalized_linear_regression(am ~ .) %>%
  summary()
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46909 -0.16762 -0.00578  0.18601  0.35635 
## 
## Coefficients:
##   (Intercept)           mpg           cyl          disp            hp          drat
##  1.6345144901  0.0264792061 -0.1207285158 -0.0005715857  0.0010378347  0.0952244996
##            wt          qsec            vs          gear          carb 
##  0.0172804532 -0.1125888981 -0.2087047973  0.1952457664 -0.0180502642 
## 
## (Dispersion paramter for gaussian family taken to be 0.0737941)
## 
##    Null  deviance: 7.71875 on 31 degress of freedom
## Residual deviance: 1.54968 on 21 degrees of freedom
## AIC: 17.926

3.7.2 Cache model data

The examples in this chapter are built using a very small data set. In real-life scenarios, where large amounts of data are used, running models based on data that first needs to be transformed become a heavy toll on the Spark session if multiple experiments are being run. That is why it is a good idea to save the results of the transformations as a new table in Spark memory, before running models.

The compute() command can take the end of a dplyr piped command set and save the results to Spark memory.

cached_cars <- cars %>% 
  mutate(cyl = paste0("cyl_", cyl)) %>%
  compute("cached_cars")
cached_cars %>%
  ml_linear_regression(mpg ~ .) %>%
  summary()
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.47339 -1.37936 -0.06554  1.05105  4.39057 
## 
## Coefficients:
## (Intercept) cyl_cyl_8.0 cyl_cyl_4.0        disp          hp        drat
## 16.15953652  3.29774653  1.66030673  0.01391241 -0.04612835  0.02635025
##           wt        qsec          vs          am       gear        carb 
##  -3.80624757  0.64695710  1.74738689  2.61726546 0.76402917  0.50935118  
## 
## R-Squared: 0.8816
## Root Mean Squared Error: 2.041

It is common to go back to visualizing and wrangling after running one or several models. The reason is that more questions are raised, as more insights are gained from the data. After multiple iterations, one or several transformations, visualizations and models are chosen to be shared with the stakeholders. The best way of doing this is the subject of the next section.

3.8 Communicate

It is important to clearly communicate the results of the analysis to the stakeholders. It is as important as the analysis work itself. If the stakeholders are not clear what you found out, there will be no clear call to action on their part. If the stakeholders ask how you found out what you did, and it is not clear how you did this, there will be no trust on what you found.

The communication artifacts that will be covered are those that R can create, and are a common output of an analysis. These artifacts are: reports and presentation decks. Beyond the type of artifact, another thing to consider is the purpose of the report. As alluded to in the outset of this section, the purposes can be divided into showing what we found, from how we found it. The rmarkdown package plays a central role in this. This pakcage is incredibly versatile and robust. This makes it worth spending the time learn about this R package. This section will cover how to use rmarkdown as the base of effective and reproducible reports and presentations.

3.8.1 Analysis versus Production work

The main “deliverable” of data analysis project is the answer to a question. The answer is encased in one or several data transformations, visualizations or models. Putting the transformations, visualizations or models into an automated process that delivers continuous insights should be considered a Production project. Analysis and Production should be considered as separate projects. Specifically to R, Production projects would involve creating a shiny app that folks in the organization depend on, and is used on a daily basis. Another example of an Production artifact, is a scheduled R script reads and updates the data store with new scores.

From the Information Technology perspective. The nature of the analysis project is experimental, the resulting code does not need to be tested or promoted as is to a Production system. In contrast, the Production project needs to pass the usual Software Development Lifecyle (SDLC) requirements before it is ready to be placed as part of the supported reports or applications withing the organization.

FIGURE 3.18: Analysis vs Production projects

3.8.2 Using R Markdown documents

R markdown documents allow weave narrative text and code together. The amount of output formats provides a very compelling reason to learn and use R Markdown. The output formats available include: HTML, PDF, PowerPoint, MS Word, presentation slides, websites and books. Most of these outputs are available in the core R packages of R Markdown: knitr and rmarkdown. Other output formats are made available by companion R packages. For example, the package xaringan adds the ability to create great looking presentation decks. This book was written using R Markdown thanks to an extension provided by the bookdown package. The best resource to delve deeper into R Markdown is the official book (Xie 2018).

For purposes of the book, and as an introduction to R Markdown, this section will review three output formats:

  • R Notebook

  • R Markdown document

  • xaringan presentation

The following figure displays when in the analysis process you should consider using them.

FIGURE 3.19: R Markdown output formats

3.8.3 Reporting results

Creating a new R Markdown document is easy. You need a YAML header, called Front Matter, at the top. After that, sections of code, called code chunks, can be interlaced with the narratives. The following example shows how easy it is to create a fully reproducible report. The narrative, code and, most importantly, the output of the code is recorded inside the resulting HTML file. You can copy and paste this following code in a file. Save the file with a .Rmd extension, and choose whatever name you would like.

---
title: "mtcars analysis"
output: html_document
---
```{r, setup, include = FALSE}
library(sparklyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.9
```

## Setup Spark environment
A local Spark environment is setup and the `mtcars` data set preloaded
```{r}
sc <- spark_connect(master = "local", config = conf)
cars <- copy_to(sc, mtcars, "mtcars_remote")
cars
```

## Visualize
A histogram was run over the `mpg` data
```{r}
library(dbplot)
cars %>%
  dbplot_histogram(mpg, binwidth = 3)
```

## Model
The selected model was a simple linear regression that uses the weight as the predictor of MPG
```{r}
cars %>%
  ml_linear_regression(wt ~ mpg) %>%
  summary()
```
```{r, include = FALSE}
spark_disconnect(sc)
```

If using the RStudio IDE, click on the Knit button at the top of the document to run it. The HTML output should look like this:

R Markdown HTML output

FIGURE 3.20: R Markdown HTML output

This report can now be sent to stakeholders. The stakeholders will not need Spark or even R to be able to read the report. All of the needed output from your Spark session was captured in the HTML document.

3.8.4 Presentation decks

Switching from an HTML output, to a PowerPoint output is as easy as changing a single option. In the top front matter, change the output option to powerpoint_presentation.

output: powerpoint_presentation

The result will be a PowerPoint presentation with all of the same information that was displayed in the HTML report. There will be a need to edit the PowerPoint template or the output of the code chunks. This minimal example show how easy it is to go from one format to another.

Another option is to use the xaringan package to create the presentation. It creates a self contained HTML presentation. This may be a better option when not everyone in the organization has PowerPoint. Converting the original R Markdown report to a xaringan deck is easy. A tripple dash needs to be added to indicate where a new slide begins. In the example, the tripple dashes are inserted above each new major header.

---
title: "mtcars analysis"
output: xaringan::moon_reader
---
```{r, setup, include = FALSE}
library(sparklyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.9
```

## Setup Spark environment
A local Spark environment is setup and the `mtcars` dataset preloaded
```{r}
sc <- spark_connect(master = "local", config = conf)
cars <- copy_to(sc, mtcars, "mtcars_remote")
cars
```
---
## Visualize
A histogram was run over the `mpg` data
```{r, fig.height = 6}
library(dbplot)
cars %>%
  dbplot_histogram(mpg, binwidth = 3)
```
```{r, include = FALSE}
spark_disconnect(sc)
```

Here is what the first full slide should look like:

R Markdown HTML output

FIGURE 3.21: R Markdown HTML output

3.9 Recap

R and Spark are a very powerful combination. Being able to use a powerful computing platform, along with an incredibly robust ecosystem of packages makes up for an ideal analysis platform. Keep in mind to push computation to Spark, and focus on collecting results in R. The result can then be used for further data manipulation or for plotting. The results can then be shared with stakeholders in a variety of outputs. A a learner of R, hopefully this chapter also encouraged you to learn more about the tidyverse as well as rmarkdown.

References

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

Xie, Grolemund, Allaire. 2018. R Markdown: The Definite Guide. 1st ed. CRC Press.