Chapter 10 Extensions

While this chatper has not been written., a few resources are available to help explore these topics until this chapter gets written.

10.1 RSparkling

rsparkling provies H2O support in Spark using sparklyr:

library(rsparkling)
library(sparklyr)
library(h2o)

cars_h2o <- as_h2o_frame(sc, cars_tbl, strict_version_check = FALSE)
h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = mtcars_h2o, lambda_search = TRUE)

See spark.rstudio.com/guides/h2o.

10.1.1 Troubleshooting

Apache IVY is a popular dependency manager focusing on flexibility and simplicity, which happens to be used by Apache Spark while installing extensions. When connection fails while using rsparkling, consider clearing your IVY Cache by running:

unlink("~/.ivy2/cache", recursive = TRUE)

10.2 GraphFrames

GraphFrames provides graph algorithms: PageRank, ShortestPaths, etc.

gf_graphframe(vertices_tbl, edges_tbl) %>% gf_pagerank(reset_prob = 0.15, max_iter = 10L)
GraphFrame
Vertices:
  $ id       <dbl> 12, 12, 59, 59, 1, 20, 20, 45, 45, 8, 8, 9, 9, 26, 26, 37, 37, 47, 47, 16, 16, 71, 71, ...
  $ pagerank <dbl> 0.0058199702, 0.0058199702, 0.0000000000, 0.0000000000, 0.1500000000, 0.0344953402, 0.0...
Edges:
  $ src    <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 58, 58, 58, 58, 58, 58, 5...
  $ dst    <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 65, 65, 65, 65, 65, 65, 6...
  $ weight <dbl> 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0...
Highschool ggraph dataset with pagerank highlighted

FIGURE 10.1: Highschool ggraph dataset with pagerank highlighted

See also spark.rstudio.com/graphframes.

10.3 Mleap

Mleap enables Spark pipelines in production.

# Create pipeline
pipeline_model <- ml_pipeline(sc) %>%
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg") %>%
  ml_fit(cars_tbl)

# Perform predictions
predictions_tbl <- ml_predict(pipeline_model, mtcars_tbl)

# Export model with mleap
ml_write_bundle(pipeline_model, predictions_tbl, "mtcars_model.zip")

Use model outside Spark and productions systems. For instance, in Java:

import ml.combust.mleap.runtime.MleapContext;

// Initialize
BundleBuilder bundleBuilder = new BundleBuilder();
MleapContext context = (new ContextBuilder()).createMleapContext();
Bundle<Transformer> bundle = bundleBuilder.load(new File(request.get("mtcars_model.zip")), context);

// Read into Mleap DataFrame
DefaultLeapFrame inputLeapFrame = new DefaultLeapFrame();

// Perform Mleap transformation
DefaultLeapFrame transformedLeapFrame = bundle.root().transform(inputLeapFrame).get();

See also spark.rstudio.com/guides/mleap.

10.4 Nested Data

library(sparklyr.nested)