Mastering Spark with R
Welcome
1
Introduction
1.1
Overview
1.2
Hadoop
1.3
Spark
1.4
R
1.5
sparklyr
1.6
Recap
2
Getting Started
2.1
Overview
2.2
Prerequisites
2.2.1
Installing sparklyr
2.2.2
Installing Spark
2.3
Connecting
2.4
Using Spark
2.4.1
Web Interface
2.4.2
Analysis
2.4.3
Modeling
2.4.4
Data
2.4.5
Extensions
2.4.6
Distributed R
2.4.7
Streaming
2.4.8
Logs
2.5
Disconnecting
2.6
Using RStudio
2.7
Resources
2.8
Recap
3
Analysis
3.1
Overview
3.2
Import
3.3
Wrangle
3.3.1
Built-in Functions
3.3.2
Correlations
3.4
Visualize
3.4.1
Using ggplot2
3.4.2
Using dbplot
3.5
Model
3.5.1
Caching
3.6
Communicate
3.7
Recap
4
Modeling
4.1
Overview
4.2
Exploratory Data Analysis
4.3
Feature Engineering
4.4
Supervised Learning
4.4.1
Generalized Linear Regression
4.4.2
Other Models
4.5
Unsupervised Learning
4.5.1
Data Preparation
4.5.2
Topic Modeling
4.6
Recap
5
Pipelines
5.1
Overview
5.2
Creation
5.3
Use Cases
5.3.1
Hyperparameter Tuning
5.4
Operating Modes
5.5
Interoperability
5.6
Deployment
5.6.1
Batch Scoring
5.6.2
Real-Time Scoring
5.7
Recap
6
Clusters
6.1
Overview
6.2
On-Premise
6.2.1
Managers
6.2.2
Distributions
6.3
Cloud
6.3.1
Amazon
6.3.2
Databricks
6.3.3
Google
6.3.4
IBM
6.3.5
Microsoft
6.3.6
Qubole
6.4
Kubernetes
6.5
Tools
6.5.1
RStudio
6.5.2
Jupyter
6.5.3
Livy
6.6
Recap
7
Connections
7.1
Overview
7.1.1
Edge Nodes
7.1.2
Spark Home
7.2
Local
7.3
Standalone
7.4
Yarn
7.4.1
Yarn Client
7.4.2
Yarn Cluster
7.5
Livy
7.6
Mesos
7.7
Kubernetes
7.8
Cloud
7.9
Batches
7.10
Tools
7.11
Multiple
7.12
Troubleshooting
7.12.1
Logging
7.12.2
Spark Submit
7.12.3
Windows
7.13
Recap
8
Data
8.1
Overview
8.2
Reading Data
8.2.1
Paths
8.2.2
Schema
8.2.3
Memory
8.2.4
Columns
8.3
Writing Data
8.4
Copy
8.5
File Formats
8.5.1
CSV
8.5.2
JSON
8.5.3
Parquet
8.5.4
Others
8.6
File Systems
8.7
Storage Systems
8.7.1
Hive
8.7.2
Cassandra
8.7.3
JDBC
8.8
Recap
9
Tuning
9.1
Overview
9.1.1
Graph
9.1.2
Timeline
9.2
Configuring
9.2.1
Connect Settings
9.2.2
Submit Settings
9.2.3
Runtime Settings
9.2.4
sparklyr Settings
9.3
Partitioning
9.3.1
Implicit Partitions
9.3.2
Explicit Partitions
9.4
Caching
9.4.1
Checkpointing
9.4.2
Memory
9.5
Shuffling
9.6
Serialization
9.7
Configuration Files
9.8
Recap
10
Extensions
10.1
Overview
10.2
H2O
10.3
Graphs
10.4
XGBoost
10.5
Deep Learning
10.6
Genomics
10.7
Spatial
10.8
Troubleshooting
10.9
Recap
11
Distributed R
11.1
Overview
11.2
Use Cases
11.2.1
Custom Parsers
11.2.2
Partitioned Modeling
11.2.3
Grid Search
11.2.4
Web APIs
11.2.5
Simulations
11.3
Partitions
11.4
Grouping
11.5
Columns
11.6
Context
11.7
Functions
11.8
Packages
11.9
Cluster Requirements
11.9.1
Installing R
11.9.2
Apache Arrow
11.10
Troubleshooting
11.10.1
Worker Logs
11.10.2
Resolving Timeouts
11.10.3
Inspecting Partitions
11.10.4
Debugging Workers
11.11
Recap
12
Streaming
12.1
Overview
12.2
Transformations
12.2.1
Analysis
12.2.2
Modeling
12.2.3
Pipelines
12.2.4
Distributed R {streaming-r-code}
12.3
Kafka
12.4
Shiny
12.5
Recap
13
Contributing
13.1
Overview
13.2
Spark API
13.3
Spark Extensions
13.4
Scala Code
13.5
Recap
14
Appendix
14.1
Preface
14.1.1
Formatting
14.2
Introduction
14.2.1
Worlds Store Capacity
14.2.2
Daily downloads of CRAN packages
14.3
Getting Started
14.3.1
Prerequisites
14.4
Analysis
14.4.1
Hive Functions
14.5
Modeling
14.5.1
MLlib Functions
14.6
Clusters
14.6.1
Google trends for mainframes, cloud computing and kubernetes
14.7
Streaming
14.7.1
Stream Generator
14.7.2
Installing Kafka
15
References
Mastering Spark with R
Chapter 15
References