Hadoop Developer Training with MapReduce

Public Classroom

Summary

Hadoop MapReduce

This 3-day hands-on Hadoop training course is designed for experienced developers and provides a fast track to building reliable and scalable application systems using Hadoop open-source software. In this class we will introduce you to the Hadoop frameworks and tools that are specifically geared towards processing of large datasets. Practical case studies will be demonstrated in class to show how Hadoop is used in real world today to solve different problems. MapReduce training is an essential component of this course.

Duration

3 days

Course Objectives

By the completion of this Hadoop course, the students should be able to:

  • Understand Hadoop main components and Architecture
  • Be comfortable working with Hadoop Distributed File System
  • Understand MapReduce abstraction and how it works
  • Program a MapReduce job
  • Master Hadoop Input and Output data formats
  • Be comfortable working with MapReduce Counters and Joins
  • Optimize Hadoop cluster for the best performance based on specific job requirements
  • Deal with Hadoop component failures and recoveries
  • Get familiar with related Hadoop projects: HBase, Hive and Pig
  • Know best practices of using Hadoop in the enterprise

Audience

This course is designed for developers with some programming experience (preferably Java) seeking to obtain a solid foundation of Hadoop Architecture. Existing knowledge of Hadoop is not required.

Pre-requisites

Hadoop frameworks and tools are Java based and a basic knowledge of the Java2 SDK is assumed. To obtain the maximum return on your training investment it is necessary to have a laptop which can compile and exercise the supplied material.

Outline

Introduction to Hadoop

  • The amount of data processing in today’s life
  • What Hadoop is why it is important
  • Hadoop comparison with traditional systems
  • Hadoop history
  • Hadoop main components and architecture

Hadoop Distributed File System (HDFS)

  • HDFS overview and design
  • HDFS architecture
  • HDFS file storage
  • Component failures and recoveries
  • Block placement
  • Balancing the Hadoop cluster

Hadoop Deployment

  • Different Hadoop deployment types
  • Hadoop distribution options
  • Hadoop competitors
  • Hadoop installation procedure
  • Distributed cluster architecture
  • Lab1: Hadoop Installation

Working with HDFS

  • Ways of accessing data in HDFS
  • Common HDFS operations and commands
  • Different HDFS commands
  • Internals of a file read in HDFS
  • Data copying with ‘distcp’
  • Lab2: Working with HDFS

Map-Reduce Abstraction

  • What MapReduce is and why it is popular
  • The Big Picture of the MapReduce
  • MapReduce process and terminology
  • MapReduce components failures and recoveries
  • Working with MapReduce
  • Lab3: Working with MapReduce

Programming MapReduce Jobs

  • Java MapReduce implementation
  • Map() and Reduce() methods
  • Java MapReduce calling code
  • Lab4: Programming Word Count

Input/Output Formats and Conversion Between Different Formats

  • Default Input and Output formats
  • Sequence File structure
  • Sequence File Input and Output formats
  • Sequence File access via Java API and HDS
  • MapFile
  • Lab5: Input Format
  • Lab6: Format Conversion

MapReduce Features

  • Joining Data Sets in MapReduce Jobs
  • How to write a Map-Side Join
  • How to write a Reduce-Side Join
  • MapReduce Counters
  • Built-in and user-defined counters
  • Retrieving MapReduce counters
  • Lab7: Map-Side Join
  • Lab8: Reduce-Side Join

Hadoop Cluster Configuration

  • Hadoop configuration overview and important configuration file
  • Configuration parameters and values
  • HDFS parameters
  • MapReduce parameters
  • Hadoop environment setup
  • ‘Include’ and ‘Exclude’ configuration files
  • Lab9: MapReduce Performance Tuning

Introduction to Hive, HBase and Pig

  • Hive as a data warehouse infrastructure
  • HBase as the “Hadoop Database”
  • Using Pig as a scripting language for Hadoop

Hadoop Case studies

  • How different organizations use Hadoop cluster in their infrastructure