Chapter 14 Appendix

14.1 Prerequisites

14.1.1 Installing R

From r-project.org, download and launch the R installer for your platform, Windows, Macs or Linux available.

The R Project for Statistical Computing

FIGURE 14.1: The R Project for Statistical Computing

14.1.2 Installing Java

From java.com/download, download and launch the installer for your platform, Windows, Macs or Linux are also available.

Java Download Page

FIGURE 14.2: Java Download Page

Starting with Spark 2.1, Java 8 is required; however, previous versions of Spark support Java 7. Regardless, we recommend installing Java 8. Java 9+ is currently unsupported so you will need to downgrade to Java 8 by uninstalling Java9+ first.

14.1.3 Installing RStudio

While installing RStudio is not strictly required to work with Spark with R, it will make you much more productive and therefore, I would recommend you take the time to install RStudio from rstudio.com/download, then download and launch the installer for your platform: Windows, Macs or Linux.

RStudio Downloads Page

FIGURE 14.3: RStudio Downloads Page

After launching RStudio, you can use RStudio’s console panel to execute the code provided in this chapter.

14.1.4 Using RStudio

If you are not familiar with RStudio, you should make note of the following panes:

  • Console: A standalone R console you can use to execute all the code presented in this book.
  • Packages: This pane allows you to install sparklyr with ease, check its version, navigate to the help contents, etc.
  • Connections: This pane allows you to connecto to Spark, manage your active connection and view the available datasets.
RStudio Overview

FIGURE 14.4: RStudio Overview

14.2 Diagrams

14.2.1 Worlds Store Capacity

library(tidyverse)
read_csv("data/01-worlds-capacity-to-store-information.csv", skip = 8) %>%
  gather(key = storage, value = capacity, analog, digital) %>%
  mutate(year = X1, terabytes = capacity / 1e+12) %>%
  ggplot(aes(x = year, y = terabytes, group = storage)) +
    geom_line(aes(linetype = storage)) +
    geom_point(aes(shape = storage)) +
    scale_y_log10(
      breaks = scales::trans_breaks("log10", function(x) 10^x),
      labels = scales::trans_format("log10", scales::math_format(10^x))
    ) +
    theme_light() +
    theme(legend.position = "bottom")

14.2.2 Daily downloads of CRAN packages

downloads_csv <- "data/01-intro-r-cran-downloads.csv"
if (!file.exists(downloads_csv)) {
  downloads <- cranlogs::cran_downloads(from = "2014-01-01", to = "2019-01-01")
  readr::write_csv(downloads, downloads_csv)
}

cran_downloads <- readr::read_csv(downloads_csv)

ggplot(cran_downloads, aes(date, count)) + 
  geom_point(colour="black", pch = 21, size = 1) +
  scale_x_date() +
  xlab("") +
  ylab("") +
  theme_light()

14.3 Formatting

The following ggplot2 theme was use to format plots in this book:

plot_style <- function() {
  font <- "Helvetica"

  ggplot2::theme_classic() +
  ggplot2::theme(
    plot.title = ggplot2::element_text(family = font,
                                       size=14,
                                       color = "#222222"),
    plot.subtitle = ggplot2::element_text(family=font,
                                          size=12,
                                          color = "#666666"),

    legend.position = "right",
    legend.background = ggplot2::element_blank(),
    legend.title = ggplot2::element_blank(),
    legend.key = ggplot2::element_blank(),
    legend.text = ggplot2::element_text(family=font,
                                        size=14,
                                        color="#222222"),

    axis.title.y = ggplot2::element_text(margin = ggplot2::margin(t = 0, r = 8, b = 0, l = 0),
                                size = 14,
                                color="#666666"),
    axis.title.x = ggplot2::element_text(margin = ggplot2::margin(t = -2, r = 0, b = 0, l = 0),
                                size = 14,
                                color = "#666666"),
    axis.text = ggplot2::element_text(family=font,
                                      size=14,
                                      color="#222222"),
    axis.text.x = ggplot2::element_text(margin = ggplot2::margin(5, b = 10)),
    axis.ticks = ggplot2::element_blank(),
    axis.line = ggplot2::element_blank(),

    panel.grid.minor = ggplot2::element_blank(),
    panel.grid.major.y = ggplot2::element_line(color = "#eeeeee"),
    panel.grid.major.x = ggplot2::element_line(color = "#ebebeb"),

    panel.background = ggplot2::element_blank(),

    strip.background = ggplot2::element_rect(fill = "white"),
    strip.text = ggplot2::element_text(size  = 20,  hjust = 0)
  )
}

Which you can then active with:

ggplot2::theme_set(plot_style())

14.4 List of ML Functions

The following table exhibits the ML algorithms supported in sparklyr:

14.4.1 Classification

Algorithm Function
Decision Trees ml_decision_tree_classifier()
Gradient-Boosted Trees ml_gbt_classifier()
Linear Support Vector Machines ml_linear_svc()
Logistic Regression ml_logistic_regression()
Multilayer Perceptron ml_multilayer_perceptron_classifier()
Naive-Bayes ml_naive_bayes()
One vs Rest ml_one_vs_rest()
Random Forests ml_random_forest_classifier()

14.4.2 Regression

Algorithm Function
Accelerated Failure Time Survival Regression ml_aft_survival_regression()
Decision Trees ml_decision_tree_regressor()
Generalized Linear Regression ml_generalized_linear_regression()
Gradient-Boosted Trees ml_gbt_regressor()
Isotonic Regression ml_isotonic_regression()
Linear Regression ml_linear_regression()

14.4.3 Clustering

Algorithm Function
Bisecting K-Means Clustering ml_bisecting_kmeans()
Gaussian Mixture Clustering ml_gaussian_mixture()
K-Means Clustering ml_kmeans()
Latent Dirichlet Allocation ml_lda()

14.4.4 Recommendation

Algorithm Function
Alternating Least Squares Factorization ml_als()

14.4.5 Frequent Pattern Mining

Algorithm Function
FPGrowth ml_fpgrowth()

14.4.6 Feature Transformers

Transformer Function
Binarizer ft_binarizer()
Bucketizer ft_bucketizer()
Chi-Squared Feature Selector ft_chisq_selector()
Vocabulary from Document Collections ft_count_vectorizer()
Discrete Cosine Transform ft_discrete_cosine_transform()
Transformation using dplyr ft_dplyr_transformer()
Hadamard Product ft_elementwise_product()
Feature Hasher ft_feature_hasher()
Term Frequencies using Hashing export(ft_hashing_tf)
Inverse Document Frequency ft_idf()
Imputation for Missing Values export(ft_imputer)
Index to String ft_index_to_string()
Feature Interaction Transform ft_interaction()
Rescale to [-1, 1] Range ft_max_abs_scaler()
Rescale to [min, max] Range ft_min_max_scaler()
Locality Sensitive Hashing ft_minhash_lsh()
Converts to n-grams ft_ngram()
Normalize using the given P-Norm ft_normalizer()
One-Hot Encoding ft_one_hot_encoder()
Feature Expansion in Polynomial Space ft_polynomial_expansion()
Maps to Binned Categorical Features ft_quantile_discretizer()
SQL Transformation ft_sql_transformer()
Standardizes Features using Corrected STD ft_standard_scaler()
Filters out Stop Words ft_stop_words_remover()
Map to Label Indices ft_string_indexer()
Splits by White Spaces ft_tokenizer()
Combine Vectors to Row Vector ft_vector_assembler()
Indexing Categorical Feature ft_vector_indexer()
Subarray of the Original Feature ft_vector_slicer()
Transform Word into Code ft_word2vec()

14.5 Kafka

This instructions were put together using information from the official Kafka site. Specifically from the Quickstart page, and at the time of writting this book. Newer versions of Kafka undoubtely will be available when reading this book. The idea is to “timestamp” the versions used in the example in the Streaming chapter.

  1. Download Kafka

    wget http://apache.claz.org/kafka/2.2.0/kafka_2.12-2.2.0.tgz
  2. Expand the tar file and enter the new folder

    tar -xzf kafka_2.12-2.2.0.tgz
    cd kafka_2.12-2.2.0
  3. Start the Zookeeper service that comes with Kafka

    bin/zookeeper-server-start.sh config/zookeeper.properties
  4. Start the Kafka service

    bin/kafka-server-start.sh config/server.properties

Make sure to always start Zookeper first, and then Kafka.