Big Data Bootcamp Training Course

Public Classroom

Summary

Big Data Bootcamp

Big Data Bootcamp is a 5-day half-lecture and half-hands-on lab class that starts out with a complete introduction to the subject of Big Data and gradually delves into in-depth topics that cover Hadoop, Spark, NoSQL and other technologies. This class equips you with the tools needed to build systems processing large amounts of data for any purpose – from archival and batch, to interactive and real-time applications.

Duration

5 days

Course Objectives

By the completion of the course you should become familiar with:

  • Hadoop: HDFS, MapReduce, Pig, Hive
  • Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
  • NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Audience

This bootcamp is designed for developers wanting to gain solid understanding of Big Data technologies. Previous experience with Java is recommended (all labs are in Java).

Pre-requisites

This course assumes you are already acquainted with the Java programming language. If you are not, we recommend enrolling in our Introduction to Java Programming training course.

Familiarity with Linux is highly desired.

Outline

Hadoop and Spark

  1. Intro to Hadoop
  2. Hadoop history, concepts
  3. Ecosystem
  4. Hadoop distributions
  5. High-level architecture
  6. Hadoop myths and challenges
  7. Hardware/Software
  8. HDFS Overview
  9. Concepts (horizontal scaling, replication, data locality, rack awareness)
  10. HDFS Architecture (Namenode, Secondary NameNode, DataNode)
  11. Data integrity
  12. Future of HDFS : Namenode HA, Federation
  13. MapReduce Overview
  14. MapReducee concepts
  15. Phases: driver, mapper, shuffle/sort, reducer
  16. Thinking in MapReduce
  17. Future of MapReduce (YARN)
  18. Pig
  19. Pig vs Java vs MapReduce
  20. Pig latin language
  21. User defined functions
  22. Understanding pig job flow
  23. Basic data analysis with Pig
  24. Complex data analysis with Pig
  25. Multi datasets with Pig
  26. Advanced concepts
  27. Hive
  28. Architecture
  29. Data types
  30. Hive data management
  31. Hive vs SQL
  32. Spark Basics
  33. Background and history
  34. Spark and hadoop
  35. Spark concepts and architecture
  36. Spark eco system (core, spark sql, mlib, streaming)
  37. First look at Spark
  38. Spark in local mode
  39. Spark web UI
  40. Spark shell
  41. Analyzing datasets
  42. Inspecting RDDs in-depth
  43. Partitions
  44. RDD Operations/Transformations
  45. RDD types
  46. MapReduce on RDD
  47. Caching and persistence
  48. Sharing cached RDDs
  49. Spark API programming

Spark and Spark Streaming

  1. Introduction to Spark API/RDD API
  2. Submitting the first program to Spark
  3. Debugging/logging
  4. Configuration properties
  5. Streaming overview
  6. Streaming operations
  7. Sliding window operations
  8. Writing spark streaming applications

NoSQL

  1. Introduction to Big Data/NoSQL
  2. NoSQL overview
  3. CAP theorem
  4. When is NoSQL appropriate
  5. NoSQL ecosystem
  6. Cassandra Basics
  7. Cassandra nodes, clusters, datacenters
  8. Keyspaces, tables, rows and columns
  9. Partitioning, replication, tokens
  10. Quorum and consistency levels
  11. Cassandra drivers
  12. Introduction to Java driver
  13. CRUD (Create/Read/Update/Delete) operations using Java client
  14. Asynchronous queries

Data Modeling

  1. introduction to CQL
  2. CQL Datatypes
  3. creating keyspaces & tables
  4. Choosing columns and types
  5. Choosing primary keys
  6. Data layout for rows and columns
  7. Time to live (TTL), create, insert, update
  8. Querying with CQL
  9. CQL updates
  10. Creating and using secondary indexes
  11. De-normalization and join avoidance
  12. Composite keys (partition keys and clustering keys)
  13. Time series data
  14. Best practices for time series data
  15. Counters
  16. Lightweight transactions (LWT)