Apache Spark Training Course

Public Classroom

Summary

Apache Spark

This 3-day intense course will introduce Apache Spark – currently the hottest Big Data project. The students will understand the place of Spark in the Hadoop/Big Data ecosystem, learn to use Spark shell, understand Spark internals, learn how to write code for Spark and how to combine it with Hadoop. Also covered are Spark SQL, Spark streaming, machine learning and GraphX.

Duration

3 days

Course Objectives

By the completion of the Spark training course, the students should be able to:

  • Understand Spark purpose, its main components and architecture
  • Be comfortable working with Spark SQL
  • Understand Spark SQL connectivity and compatibility with Hive
  • Undersdand Spark ecosystem and be comfortable working with Spark API
  • Be familiar with Spark web UI and shell
  • Understand Spark streaming and its applications
  • Be comfortable working with GraphX, Spark’s API for graphs and graph-parallel computation
  • Understand MLlib, Apache Spark’s scalable machine learning library
  • Tune Spark performance and memory management
  • Understand how Spark works in Cluster mode

Audience

This course is designed for Java/Python developers seeking to learn and get comfortable using Spark, Apache’s cutting-edge engine for large-scale data processing.

Pre-requisites

To get the most out of this training, you must be a developer with the following skills:

  • Comfortable working with Java or Python (Scala is a plus but will be explained)
  • Be able to navigate Linux command line
  • Have a basic knowledge of Linux editors (VI/Nano) for modifying code

Lab Environment

  • A working Spark environment will be provided for students. Students would need a SSH client and a browser to access the cluster.
  • Zero Install: There is no need to install software on students’ machines!

Course Outline

Spark Basics

  • Background and history
  • Spark concepts and architecture
  • Run modes: Local vs Distributed
  • RDDs

Spark Ecosystem

  • Spark core
  • Spark sql
  • Mlib
  • Spark streaming
  • Spark API

First Look at Spark

  • Spark in Local mode
  • Spark Web UI
  • Spark Shell
  • Analyzing datasets – Part 1
  • Inspecting Spark Resilient Distributed Datasets (RDD)s

Resilient Distributed Datasets (RDD)s In Depth

  • Partitions
  • RDD Operations/Transformations
  • RDD types
  • MapReduce on RDD
  • Caching and persistance
  • Sharing cached RDDs

Spark and Hadoop

  • Hadoop architecture
  • HDFS intro
  • YARN intro
  • Running Spark on Hadoop
  • Processing HDFS files using Spark
  • Data Locality

Spark API Programming

  • Introduction to Spark API and RDD API
  • Submitting the first program to Spark
  • Debugging and logging
  • Configuration properties

Spark SQL

  • SQL context
  • Defining tables and importing datasets
  • Querying

Spark Streaming

  • Streaming overview
  • Streaming operations
  • Sliding window operations
  • Writing Spark streaming applications

Spark Mlib

  • Mlib intro
  • Mlib algorithms
  • Writing Mlib applications

Spark GraphX

Spark Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory management

Bonus Lab: Spark in Cluster mode

  • Standalone Spark cluster
  • Inspecting master and workers in UIs
  • Configurations
  • Distributed processing of large data sets