Chapter 4 Modeling

In this chapter, we discover how Spark can be used to scale up machine learning workflows to big data. We build off of the ideas presented in the Analysis chapter and introduce the machine learning (ML) aspects. Spark MLlib is the component of Spark that allows one to write high level code to perform machine learning tasks on distributed data. Sparklyr provides an interface to the ML algorithms that should be familiar to R users. For example, you can run a logistic regression as follows:

ml_logistic_regression(mtcars, am ~ .)
## Formula: am ~ .
## Coefficients:
##  (Intercept)          mpg          cyl         disp           hp         drat 
##  -0.68057477   1.73068529  -6.50306685  -0.11106774   0.01566047  33.02750111 
##           wt         qsec           vs         gear         carb 
## -20.68143251  -9.52647833  -6.81113196  29.16524289   3.33862282 

As can be seen in List of ML functions, Spark provides a wide range of algorithms and feature transformers, and we will touch on a representative portion of the functionality. A complete treatment of predictive modeling concepts is outside the scope of this book, so we recommend complementing with “R for Data Science” (Wickham and Grolemund 2016) and “Feature Engineering and Selection: A Practical Approach for Predictive Models” (Kuhn, Max and Johnson, Kjell 2019), from which we adopted for examples and visualizations in this chapter.

4.1 Overview

The tasks we focus on in this chapter involve machine learning, which is what Spark ML aims to enable, as opposed to statistical inference. This means that we are often more concerned about forecasting the future rather than inferring the process by which our data is generated.1 Machine learning can be categorized into supervised learning, or predictive modeling, and unsupervised learning. In supervised learning, we try to learn a function that will map from “X” to “Y”, from a dataset of “(x, y)” examples. In unsupervised learning, we just have “X” and not the “Y” labels, so instead we try to learn something about the structure of “X”. Some practical use cases for supervised learning include forecasting the weather tomorrow, determining whether a credit card transaction is fradulent, and coming up with a price for your car insurance policy. For unsupervised learning, examples include automated grouping of photos of individuals, segmenting customers based on their purchase history, and clustering of documents.

The ML interface in sparklyr has been designed to minimize the cognitive effort for moving from a local, in-memory, native R workflow to the cluster, and back. While the Spark ecosystem is very rich, there is still a tremendous number of packages from CRAN, with some implementing functionality that you may require for a project. Also, you may want to leverage your skills and experience working in R to maintain productivity. What we learned in the Analysis section also applis here — it is important to keep track of where you are performing computations, and move between the cluster and your R session as appropriate.

In the next section, we will introduce the dataset that will be used throughout the chapter. We then demonstrate a supervised learning workflow that includes exploratory data analysis, feature engineering, and model building. We then move on to a topic modeling example that looks at some unstructured text data. Keep in mind that our goal will be to show various techniques of executing data science tasks on large data, rather than conducting a rigorous and coherent analysis.

4.2 The Data

The examples in this chapter will utilize the OkCupid dataset, available at.2 The dataset consists of user profile data from an online dating site, and contains a diverse set of features, including biographical characteristics such as gender and profession, and free text fields related to personal interests. There are about 60,000 profiles in the dataset, which fits comfortably into memory on a modern laptop and wouldn’t be considered “big data”, so you can easily follow along running Spark local mode. In a later chapter, we’ll discuss specific considerations for dealing with distributed datasets on clusters.

To motivate the examples, we will consider the following problem:

Predict whether someone is actively working, i.e. not retired, a student, or unemployed.

Note: The examples in this chapter utilize small datasets so readers can easily follow along in local mode. In practice, if your dataset fits comfortably in memory on your local machine, you may be better off using an efficient non-distributed implementation of the ML algorithm. For example, you may want to use the ranger package instead of ml_random_forest_classifier().

Next up, we will look at the data.

4.3 Exploratory Data Analysis

Exploratory data analysis (EDA), in the context of predictive modeling, is the exercise of looking at excerpts and summaries of the data. The specific goals of the EDA stage is informed by the business problem, but here are some common objectives:

  • Checking for data quality — confirming meaning and prevalence of missing values and reconciling statistics against existing controls,
  • Understand univariate relationships between variables, and
  • Perform an initial assessment on what variables to include and what transformations need to be done on them.

We’ll first read in the data, assuming profiles.csv is in the data/ folder.

okc <- spark_read_csv(
  escape = "\"", 
  options = list(multiline = TRUE)
) %>%
    height = as.numeric(height),
    income = ifelse(income == "-1", NA, as.numeric(income))
  ) %>%
  mutate_if(is.character, list(~ ifelse(, "missing", .)))

We specify escape = "\"" and options = list(multiline = TRUE) here to accommodate for embedded quote characters and newlines in the essay fields. We also convert the height and income columns to numeric types, and recode missing values in the string columns. Note that it may very well take a few tries of specifying different parameters to get the initial data ingest correct, and sometimes you may have to revisit this step after you learn more about the data during modeling.

We can now take a quick look at our data with glimpse():

## Observations: ??
## Variables: 31
## Database: spark_connection
## $ age         <int> 22, 35, 38, 23, 29, 29, 32, 31, 24, 37, 35…
## $ body_type   <chr> "a little extra", "average", "thin", "thin…
## $ diet        <chr> "strictly anything", "mostly other", "anyt…
## $ drinks      <chr> "socially", "often", "socially", "socially…
## $ drugs       <chr> "never", "sometimes", "missing", "missing"…
## $ education   <chr> "working on college/university", "working …
## $ essay0      <chr> "about me:<br />\n<br />\ni would love to …
## $ essay1      <chr> "currently working as an international age…
## $ essay2      <chr> "making people laugh.<br />\nranting about…
## $ essay3      <chr> "the way i look. i am a six foot half asia…
## $ essay4      <chr> "books:<br />\nabsurdistan, the republic, …
## $ essay5      <chr> "food.<br />\nwater.<br />\ncell phone.<br…
## $ essay6      <chr> "duality and humorous things", "missing", …
## $ essay7      <chr> "trying to find someone to hang out with. …
## $ essay8      <chr> "i am new to california and looking for so…
## $ essay9      <chr> "you want to be swept off your feet!<br />…
## $ ethnicity   <chr> "asian, white", "white", "missing", "white…
## $ height      <dbl> 75, 70, 68, 71, 66, 67, 65, 65, 67, 65, 70…
## $ income      <dbl> NaN, 80000, NaN, 20000, NaN, NaN, NaN, NaN…
## $ job         <chr> "transportation", "hospitality / travel", …
## $ last_online <chr> "2012-06-28-20-30", "2012-06-29-21-41", "2…
## $ location    <chr> "south san francisco, california", "oaklan…
## $ offspring   <chr> "doesn&rsquo;t have kids, but might want t…
## $ orientation <chr> "straight", "straight", "straight", "strai…
## $ pets        <chr> "likes dogs and likes cats", "likes dogs a…
## $ religion    <chr> "agnosticism and very serious about it", "…
## $ sex         <chr> "m", "m", "m", "m", "m", "m", "f", "f", "f…
## $ sign        <chr> "gemini", "cancer", "pisces but it doesn&r…
## $ smokes      <chr> "sometimes", "no", "no", "no", "no", "no",…
## $ speaks      <chr> "english", "english (fluently), spanish (p…
## $ status      <chr> "single", "single", "available", "single",…

Now we will add our response variable as a column in the dataset and look at its distribution

okc <- okc %>%
    not_working = ifelse(job %in% c("student", "unemployed", "retired"), 1 , 0)

okc %>% 
  group_by(not_working) %>% 
## # Source: spark<?> [?? x 2]
##   not_working     n
##         <dbl> <dbl>
## 1           0 54541
## 2           1  5405

Before we proceed further, let us perform an initial split of our data into a training set and a testing set and put away the latter. In practice, this is a crucial step because we would like to have a holdout set that we set aside at the end of the modeling process to evaluate model performance. If we were to include the entire dataset during EDA, information from the testing set could “leak” into the visualizations and summary statistics, and bias our model building process even though the data is not used directly in a learning algorithm. This would undermine the credibility of our performance metrics. Splitting the data can be done easily by using the sdf_partition() function:

data_splits <- sdf_random_split(okc, training = 0.8, testing = 0.2, seed = 42)
okc_train <- data_splits$training
okc_test <- data_splits$testing

We can quickly look at the distribution of our response variable:

okc_train %>%
  group_by(not_working) %>%
  tally() %>%
  mutate(frac = n / sum(n, na.rm = TRUE))
## # Source: spark<?> [?? x 3]
##   not_working     n   frac
##         <dbl> <dbl>  <dbl>
## 1           0 43785 0.910 
## 2           1  4317 0.0897

Using the sdf_describe() function, we can obtain numerical summaries of specific columns:

sdf_describe(okc_train, cols = c("age", "income"))
## # Source: spark<?> [?? x 3]
##   summary age                income            
##   <chr>   <chr>              <chr>             
## 1 count   48102              9193              
## 2 mean    32.336534863415245 104968.99815076689
## 3 stddev  9.43908920033797   202235.2291773537 
## 4 min     18                 20000.0           
## 5 max     110                1000000.0   

Like we saw in the Analysis chapter, we can also utilize the dbplot package to plot distributions of these variables:

okc_train %>%
Distribution of age

FIGURE 4.1: Distribution of age

A common EDA exercise is to look at the relationship between the response and the individual predictors. For example, we can explore the religion variable:

prop_data <- okc_train %>%
  mutate(religion = regexp_extract(religion, "^\\\\w+", 0)) %>% 
  group_by(religion, not_working) %>%
  tally() %>%
  group_by(religion) %>%
    count = sum(n, na.rm = TRUE),
    prop = sum(not_working * n, na.rm = TRUE) / sum(n, na.rm = TRUE)
  ) %>%
  mutate(se = sqrt(prop * (1 - prop) / count)) %>%

## # A tibble: 10 x 4
##    religion     count   prop      se
##    <chr>        <dbl>  <dbl>   <dbl>
##  1 judaism       2520 0.0794 0.00539
##  2 atheism       5624 0.118  0.00436
##  3 christianity  4671 0.120  0.00480
##  4 hinduism       358 0.101  0.0159 
##  5 islam          115 0.191  0.0367 
##  6 agnosticism   7078 0.0958 0.00346
##  7 other         6240 0.0841 0.00346
##  8 missing      16152 0.0719 0.002  
##  9 buddhism      1575 0.0851 0.007  
## 10 catholicism   3769 0.0886 0.00458

Note that prop_data is a small data frame that has been collected into memory in our R session, we can take advantage of ggplot2 to create an informative visualization.

Proportion of individuals not working by religion

FIGURE 4.2: Proportion of individuals not working by religion

Next, we take a look at the relationship between a couple of predictors: alcohol use and drug use. We would expect there to be some correlation between them. You can compute a contingency table via sdf_crosstab():

contingency_tbl <- okc_train %>% 
  sdf_crosstab("drinks", "drugs") %>%

## # A tibble: 7 x 5
##   drinks_drugs missing never often sometimes
##   <chr>          <dbl> <dbl> <dbl>     <dbl>
## 1 very often        54   144    44       137
## 2 socially        8221 21066   126      4106
## 3 not at all       146  2371    15       109
## 4 desperately       72    89    23        74
## 5 often           1049  1718    69      1271
## 6 missing         1121  1226    10        59
## 7 rarely           613  3689    35       445

We can visualize this contingency table using a mosaic plot.

Mosaic plot of drug and alcohol use

FIGURE 4.3: Mosaic plot of drug and alcohol use

To further explore the relationship between this two variables, we can perform correspondence analysis using the FactoMineR package and visualize the results.

Correspondence analysis principal coordinates for drugs and alcohol use

FIGURE 4.4: Correspondence analysis principal coordinates for drugs and alcohol use

Correspondence analysis transforms the factors into variables called principal coordinates, which correspond to the axes in the plot and represent how much information in the contingency table they contain. We can, for example, interpret the proximity of “drinking often” and “using drugs very often” as indicating association.

This concludes our discussion on EDA, and we will now proceed to feature engineering.

4.4 Feature Engineering

The feature engineering exercise comprises transforming the data to increase the performance of the model. This can include things like centering and scaling numerical values and performing string manipulation to extract meaningful variables. It also often includes variable selection — the process of selecting which predictors are used in the model.

Let us start with scaling the age variable by removing the mean and scaling to unit variance. Some algorithms, especially neural networks, train faster if we normalize our inputs. We begin by calculating the mean and standard deviation of the variable:

scale_values <- okc_train %>%
    mean_age = mean(age, na.rm = TRUE),
    sd_age = sd(age)
  ) %>%

## # A tibble: 1 x 2
##   mean_age sd_age
##      <dbl>  <dbl>
## 1     32.3   9.44

We can then use these to transform the dataset:

okc_train <- okc_train %>%
  mutate(scaled_age = (age - !!scale_values$mean_age) / !!scale_values$sd_age)
okc_train %>%
Distribution of scaled age

FIGURE 4.5: Distribution of scaled age

Since some of the profile feature are multiple-select, we need to process them before we can build meaningful models. If we take a look at the ethnicity column, for example, we see that there are many different combinations:

okc_train %>%
  group_by(ethnicity) %>%
## # Source: spark<?> [?? x 2]
##    ethnicity                                     n
##    <chr>                                     <dbl>
##  1 hispanic / latin, white                    1051
##  2 black, pacific islander, hispanic / latin     2
##  3 asian, black, pacific islander                5
##  4 black, native american, white                91
##  5 middle eastern, white, other                 34
##  6 asian, other                                 78
##  7 asian, black, white                          12
##  8 asian, hispanic / latin, white, other         7
##  9 middle eastern, pacific islander              1
## 10 indian, hispanic / latin                      5
## # … with more rows

For our model, we create indicator variables for each race, as follows:

ethnicities <- c("asian", "middle eastern", "black", "native american", "indian", 
                 "pacific islander", "hispanic / latin", "white", "other")
ethnicity_vars <- ethnicities %>% 
  map(~ expr(ifelse(like(ethnicity, !!.x), 1, 0))) %>%
  set_names(paste0("ethnicity_", gsub("\\s|/", "", ethnicities)))
okc_train <- mutate(okc_train, !!!ethnicity_vars)
okc_train %>% 
  select(starts_with("ethnicity_")) %>%
## Observations: ??
## Variables: 9
## Database: spark_connection
## $ ethnicity_asian           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_middleeastern   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_black           <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_nativeamerican  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_indian          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_pacificislander <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_hispaniclatin   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_white           <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0…
## $ ethnicity_other           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

For the free text fields, a straightforward way to extract features is counting the total number of characters.

okc_train <- okc_train %>%
    essay_length = char_length(paste(!!!syms(paste0("essay", 0:9))))
okc_train %>%
  dbplot_histogram(essay_length, bins = 50)
Distribution of essay length

FIGURE 4.6: Distribution of essay length

Now that we have a few more features to work with, we can begin running some ML algorithms!

4.5 Model Building

Once we have a good grasp on our dataset, we can start building some models. Before we do so, however, we need to come up with a plan to tune and validate our candidate models. Since we are dealing with a binary classification problem, the metrics one can use include accuracy, precision, sensitivity, and area under the receiver operating characteristic (ROC) curve (AUC), among others. The metric you optimize depends on your specific business problem, but for this exercise, we will focus on the AUC.

It is important that we don’t peek at the testing holdout set until the very end. For tuning and validation, we will perform 10-fold cross validation. The scheme works as follows: We first divide our dataset into 10 approximately equal sized subsets. We take the 2nd to 10th sets together as the training set for an algorithm, and validate the resulting model on the 1st set. Next, we reserve the 2nd set as the validation set, and train the algorithm on the 1st and 3rd to 10th sets. In total, we train ten models and average the performance. If time and resources allow, you can also perform this procedure multiple times with different random partitions of the data. In our case, we will demonstrate how to perform the cross validation once.

Using the sdf_partition() function, we can create a list of subsets from our okc_train table:

vfolds <- sdf_partition(
  weights = set_names(rep(0.1, 10), paste0("fold", 1:10)),
  seed = 42

We then create our first training/validation split as follows:

train_set <-, vfolds[2:10])
validation_set <- vfolds[[1]]

One item we need to carefully treat here is the scaling of variables. We need to make sure we do not leak any information from the validation set to the training set, so we calculate the mean and standard deviation on the training set only, and apply the same transformation to both sets. Here is how we would handle this for the age variable:

make_scale_age <- function(training) {
  scale_values <- training %>%
      mean_age = mean(age, na.rm = TRUE),
      sd_age = sd(age)
    ) %>%

  function(data) {
    data %>%
      mutate(scaled_age = (age - !!scale_values$mean_age) / !!scale_values$sd_age)

scale_age <- make_scale_age(train_set)
train_set <- scale_age(train_set)
validation_set <- scale_age(validation_set)

Logistic regression is often a reasonable starting point for binary classification problems, so let us give it a try. Suppose also that our domain knowledge provides us with an initial set of predictors. We can then fit a model by using the formula interface:

lr <- ml_logistic_regression(
  train_set, not_working ~ scaled_age + sex + drinks + drugs + essay_length
## Formula: not_working ~ scaled_age + sex + drinks + drugs + essay_length
## Coefficients:
##       (Intercept)        scaled_age             sex_m   drinks_socially 
##     -2.823517e+00     -1.309498e+00     -1.918137e-01      2.235833e-01 
##     drinks_rarely      drinks_often drinks_not at all    drinks_missing 
##      6.732361e-01      7.572970e-02      8.214072e-01     -4.456326e-01 
## drinks_very often       drugs_never     drugs_missing   drugs_sometimes 
##      8.032052e-02     -1.712702e-01     -3.995422e-01     -7.483491e-02 
##      essay_length 
##      3.664964e-05 

To obtain a summary of performance metrics on the validation set, we can use the ml_evaluate() function.

validation_summary <- ml_evaluate(lr, validation_set)

You can print validation_summary to see the available metrics

## BinaryLogisticRegressionSummaryImpl 
##  Access the following via `$` or `ml_summary()`. 
##  - features_col() 
##  - label_col() 
##  - predictions() 
##  - probability_col() 
##  - area_under_roc() 
##  - f_measure_by_threshold() 
##  - pr() 
##  - precision_by_threshold() 
##  - recall_by_threshold() 
##  - roc() 
##  - prediction_col() 
##  - accuracy() 
##  - f_measure_by_label() 
##  - false_positive_rate_by_label() 
##  - labels() 
##  - precision_by_label() 
##  - recall_by_label() 
##  - true_positive_rate_by_label() 
##  - weighted_f_measure() 
##  - weighted_false_positive_rate() 
##  - weighted_precision() 
##  - weighted_recall() 
##  - weighted_true_positive_rate() 

We can plot the ROC curve by collecting the output of validation_summary$roc() and using ggplot2:

ROC curve for the logistic regression model

FIGURE 4.7: ROC curve for the logistic regression model

Recall that the ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) for varying values of the classification threshold. In practice, the business problem helps to determine where on the curve one sets the threshold for classification. The AUC is a summary measure for determining the quality of a model, and we can compute it by calling the area_under_roc() function.

## [1] 0.7872754

Note: Spark only provides evaluation methods for generalized linear models (including linear models and logistic regression.) For other algorithms, you can use the evaluator functions (e.g. ml_binary_classification_evaluator() on the prediction data frame) or compute your own metrics.

Now, we can easily repeat the logic we have above and apply it to each train/validation split:

cv_results <- map_df(1:10, function(v) {
  train_set <-, vfolds[setdiff(1:10, v)])
  validation_set <- vfolds[[v]]
  scale_age <- make_scale_age(train_set)
  train_set <- scale_age(train_set)
  validation_set <- scale_age(validation_set)
  model <- ml_logistic_regression(
    train_set, not_working ~ scaled_age + sex + drinks + drugs + essay_length
  s <- ml_evaluate(model, validation_set)
  roc_df <- s$roc() %>% 
  auc <- s$area_under_roc()
    Resample = paste0("Fold", str_pad(v, width = 2, pad = "0")),
    roc_df = list(roc_df),
    auc = auc

This gives us 10 ROC curves:

Cross-validated ROC curves for the logistic regression model

FIGURE 4.8: Cross-validated ROC curves for the logistic regression model

and we can obtain the average AUC metric:

## [1] 0.7715102

4.5.1 Logistic Regression as a Generalized Linear Regression

In Spark ML, you can also fit a logistic regression via the generalized linear regression interface by specifying family = "binomial". Because the result is a regression model, the ml_predict() method does not give class probabilities. However, this interface may be of interest to those who are looking for generalized linear model (GLM) diagnostics, including confidence intervals for coefficient estimates.

glr <- ml_generalized_linear_regression(
  not_working ~ scaled_age + sex + drinks + drugs, 
  family = "binomial"

tidy_glr <- tidy(glr)

We can extract the coefficient estimates into a tidy data frame, which we can then process further, for example, to create a coefficient plot.

Coefficient estimates with 95% confidence intervals

FIGURE 4.9: Coefficient estimates with 95% confidence intervals

Note: Both ml_logistic_regression() and ml_linear_regression() support elastic net regularization (Zou and Hastie 2005) through the reg_param and elastic_net_param parameters. reg_param corresponds to \(\lambda\) whereas elastic_net_param correspond to \(\alpha\). ml_generalized_linear_regression() supports only reg_param.

4.5.2 More Machine Learning Algorithms

Beyond linear models, Spark ML supports many of the standard ML algorithms with the mechanism demonstrated above, you can try different algorithms and hyperparameters for your problem. You can find a list of supported ML related functions in the Appendix. The interfaces to access these functionalities are largely identical, so it is easy to experiment with them. For example, to fit a neural network model we can run:

nn <- ml_multilayer_perceptron_classifier(
  not_working ~ scaled_age + sex + drinks + drugs + essay_length, 
  layers = c(12, 64, 64, 2)

This gives us a feedforward neural network model with two hidden layers of 64 nodes each. Note that you have to specify the correct values for the input and output layers in the layers argument. We can obtain predictions on a validation set using ml_predict()

predictions <- ml_predict(nn, validation_set)

then compute the AUC via ml_binary_classification_evaluator()

## [1] 0.7812709

Up until now, we have not look into the unstructured text in the essay fields apart from doing simple character counts. In the next section, we will explore the textual data in more depth.

4.6 Working with Textual Data

Along with speech, images, and videos, textual data is one of the components of the “big data” explosion. Prior to modern text mining techniques and the computational resources to support them, companies had little use for freeform text fields. Today, text is considered a rich source of insights, and can be found anywhere from physician’s notes to customer complaints. In this section, we show some basic text analysis capabilities of sparklyr. If you would like more background on text mining techniques, we recommend checking out “Text Mining with R: A Tidy Approach”. (???)

In this section, we show how to perform a basic topic modeling task on the essay data in the OKCupid dataset. Our plan is to concatenate the essay fields (of which there are 10) of each profile, and regard each profile as a document, then attempt to discover topics using Latent Dirichlet Allocation (LDA).

4.6.1 Data Prep

As always, before analyzing a dataset (or a subset of one), we want to take a quick look to orient ourselves.

essay_cols <- paste0("essay", 0:9)
essays <- okc %>%
essays %>% 
## Observations: ??
## Variables: 10
## Database: spark_connection
## $ essay0 <chr> "about me:<br />\n<br />\ni would love to think that…
## $ essay1 <chr> "currently working as an international agent for a f…
## $ essay2 <chr> "making people laugh.<br />\nranting about a good sa…
## $ essay3 <chr> "the way i look. i am a six foot half asian, half ca…
## $ essay4 <chr> "books:<br />\nabsurdistan, the republic, of mice an…
## $ essay5 <chr> "food.<br />\nwater.<br />\ncell phone.<br />\nshelt…
## $ essay6 <chr> "duality and humorous things", "missing", "missing",…
## $ essay7 <chr> "trying to find someone to hang out with. i am down …
## $ essay8 <chr> "i am new to california and looking for someone to w…
## $ essay9 <chr> "you want to be swept off your feet!<br />\nyou are …

Just from this output, we see that

  • The text contain HTML tags,
  • The text contains the newline \n character, and
  • There are missing values in the data.

As you analyze your own text data, you will quickly come across and become familiar with the peculiarities of the specific dataset. Preoprocessing text data, like with tabular numerical data, is an iterative process, and after a few tries we have the following transformations:

essays <- essays %>%
  # Replace `missing` with empty string.
  mutate_all(list(~ ifelse(. == "missing", "", .))) %>%
  # Concatenate the columns.
  mutate(essay = paste(!!!syms(essay_cols))) %>%
  # Replace URLs with the "URL" string
  mutate(words = regexp_replace(essay, !!re2, "URL")) %>%
  # Remove miscellaneous characters and HTML tags
  mutate(words = regexp_replace(words, "\\n|&nbsp;|<[^>]*>|[^A-Za-z|']", " "))

Note here we are using regex_replace(), which is a Spark SQL function.

4.6.2 Topic Modeling

LDA is a type of topic model for identifying abstract “topics” in a set of documents. It is an unsupervised algorithm in that we do not provide any labels, or topics, for the input documents. LDA posits that each document is a mixture of topics, and each topic is a mixture of words. During training, it attempts to estimate both of these simultaneously. A typical use case for topic models involves categorizing many documents, where the large number of documents renders manually approaches infeasible. The application domains range from GitHub issues to legal documents.

Once we have a reasonably clean dataset following the workflow in the previous section, we can fit an LDA model with ml_lda():

stop_words <- ml_default_stop_words(sc) %>%
    "like", "love", "good", "music", "friends", "people", "life",
    "time", "things", "food", "really", "also", "movies"

lda_model <-  ml_lda(essays, ~ words, k = 6, max_iter = 1, min_token_length = 4, stop_words = stop_words, min_df = 5)

We are also including a stop_words vector consisting of commonly used English words and common words in our dataset, which tells the algorithm to ignore them. After the model is fit, we can use the tidy() function to extract the associated betas, which are the per-topic-per-word probabilities, from the model.

betas <- tidy(lda_model)
## # A tibble: 256,992 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     0 know      303.
##  2     0 work      250.
##  3     0 want      367.
##  4     0 books     211.
##  5     0 family    213.
##  6     0 think     291.
##  7     0 going     160.
##  8     0 anything  292.
##  9     0 enjoy     145.
## 10     0 much      272.
## # … with 256,982 more rows

We can then visualize this output by looking at word probabilities by topic. In Fig. N we show the results at 1 iteration and 100 iterations.

The terms that are most common within each topic, 1 iteration

FIGURE 4.10: The terms that are most common within each topic, 1 iteration

The terms that are most common within each topic, 100 iterations

FIGURE 4.11: The terms that are most common within each topic, 100 iterations

We can see that, at 100 iterations, we can see “topics” starting to emerge!

4.7 Conclusion

In this chapter, we cover the basics of building preditive models with sparklyr. You should now have the knowledge to begin applying machine learning to your own large datasets. In the next chapter on pipelines and deployment, we will look at ways to make your workflow more flexible for the advanced use cases.


Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc.

Kuhn, Max and Johnson, Kjell. 2019. “Feature Engineering and Selection: A Practical Approach for Predictive Models.”

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20.