Hadoop Administration Training Course

Public Classroom

Summary

Hadoop Administration

This 3-day hands-on Hadoop for System Administrators class is designed for technical operations personnel whose job is to install and maintain production Hadoop clusters in real world. We will cover Hadoop architecture and its components, installation process, monitoring and troubleshooting of the complex Hadoop issues. The class includes practical hands-on exercises and encourages open discussions of how people are using Hadoop in enterprises dealing with large data sets.

Duration

3 days

Course Objectives

By the completion of this Hadoop class, the students should be able to:

  • Understand Hadoop main components and architecture
  • Be comfortable working with Hadoop Distributed File System
  • Understand MapReduce abstraction and how it works
  • Plan your Hadoop cluster
  • Deploy and administer Hadoop cluster
  • Optimize Hadoop cluster for the best performance based on specific job requirements
  • Monitor a Hadoop cluster and execute routine administration procedures
  • Deal with Hadoop component failures and recoveries
  • Get familiar with related Hadoop projects: Hbase, Hive and Pig
  • Know best practices of using Hadoop in enterprise world

Audience

This course is designed for system administrators and support engineers who will maintain and troubleshoot Hadoop clusters in production or development environments.

Pre-requisites

This course is designed for people with at least a basic level of Linux system administration experience. Prior knowledge of Hadoop is not required.

Outline

Introduction to Hadoop

  • The amount of data processing in today’s life
  • What Hadoop is why it is important
  • Hadoop comparison with traditional systems
  • Hadoop history
  • Hadoop main components and architecture

Hadoop Distributed File System (HDFS)

  • HDFS overview and design
  • HDFS architecture
  • HDFS file storage
  • Component failures and recoveries
  • Block placement
  • Balancing the Hadoop cluster

Planning your Hadoop cluster

  • Planning a Hadoop cluster and its capacity
  • Hadoop software and hardware configuration
  • HDFS Block replication and rack awareness
  • Network topology for Hadoop cluster

Hadoop Deployment

  • Different Hadoop deployment types
  • Hadoop distribution options
  • Hadoop competitors
  • Hadoop installation procedure
  • Distributed cluster architecture
  • Lab: Hadoop Installation

Working with HDFS

  • Ways of accessing data in HDFS
  • Common HDFS operations and commands
  • Different HDFS commands
  • Internals of a file read in HDFS
  • Data copying with ‘distcp’
  • Lab: Working with HDFS

Map-Reduce Abstraction

  • What MapReduce is and why it is popular
  • The Big Picture of the MapReduce
  • MapReduce process and terminology
  • MapReduce components failures and recoveries
  • Working with MapReduce

Hadoop Cluster Configuration

  • Hadoop configuration overview and important configuration file
  • Configuration parameters and values
  • HDFS parameters
  • MapReduce parameters
  • Hadoop environment setup
  • ‘Include’ and ‘Exclude’ configuration files
  • Lab: MapReduce Performance Tuning

Hadoop Administration and Maintenance

  • Namenode/Datanode directory structures and files
  • Filesystem image and Edit log
  • The Checkpoint Procedure
  • Namenode failure and recovery procedure
  • Safe Mode
  • Metadata and Data backup
  • Potential problems and solutions / What to look for…
  • Adding and removing nodes
  • Lab: MapReduce Filesystem Recovery

Hadoop Monitoring and Troubleshooting

  • Best practices of monitoring a Hadoop cluster
  • Using logs and stack traces for monitoring and troubleshooting
  • Using open-source tools to monitor Hadoop cluster

Job Scheduling

  • How to schedule Hadoop Jobs on the same cluster
  • Default Hadoop FIFO Schedule
  • Fair Scheduler and its configuration

Introduction to Hive, HBase and Pig

  • Hive as a data warehouse infrastructure
  • HBase as the “Hadoop Database”
  • Using Pig as a scripting language for Hadoop

Hadoop Case studies

  • How different organizations use Hadoop cluster in their infrastructure