Hive Training Course

Public Classroom

Summary

Hive Logo

Hive is a system for querying and managing structured data built on top of Hadoop. It uses Map-Reduce for execution, HDFS for storage, and structured data with rich data types (structs, lists and maps) to represent data. Hive allows to directly query data from different formats (text/binary) and file formats (Flat/Sequence) using SQL as a familiar programming tool for standard analytics. Hive provides extensibility using embedded scripts for non standard applications, and it supports rich metadata to allow data discovery and optimization. This comprehensive one-day Hive training class gives you the skills you need to start using Hive in your project.

Duration

1 day

Course Objectives

By the completion of this Hadoop course, the participants should be able to:

  • Understand the main concepts of using Hive
  • Create Hive’s native and external tables
  • Write SQL queries and learn some tricks of optimization
  • Debug and resolve issues
  • Write plugable Map-Reduce scripts
  • Learn important settings and some administration tasks

Audience

This course is designed for Software Engineers, Administrators of Hadoop/Hive, Data Analysts

Pre-requisites

Working knowledge of SQL. Some knowledge of scripting languages. Basic understanding of Linux operating system

Outline

Why Hive vs. regular Map-Reduce?

  • History
  • Definitions and terminology

Hive’s architecture and functionality

  • Services and interoperability with Hadoop
  • Query processor

Hive’s MetaData

  • Creating new tables
  • Partitioned tables
  • Dynamic partitions
  • Tables with different serialization and encoding formats

Writing Hive’s complex queries

  • Different kinds of joins
  • Embedding custom scripts

Administration of running Hive queries

  • Hadoop permissions and groups
  • Enabling jobs scheduling/prioritizing strategies
  • Setting controls on shared resources
  • Hive’s production quality metadata storage and its backup
  • Tools for jobs control flow – overview

Advanced Hive functionality

  • Writing embedded Map/Reduce scripts
  • Considerations of Map vs.Reduce, RAM vs. writes
  • Writing embedded Java UDF and UDAF

Case studies and best practices

Instructor

Vladislav Tcheprasov

Vladislav Tcheprasov Vlad is a software engineer-scientist with over 10 years of experience in software architecture and Object Oriented development, focusing on sophisticated, performance sensitive, data analysis problems. Vlad is experienced at working with large amounts of data using Big Data solutions such as Hadoop, Hive and applying ML techniques. His specialties are Data analysis/processing, applied Machine Learning, Distributed Computing, High Performance Algorithms and Predictive Modeling. At his current job which focuses on online advertising using behavior analysis, Vlad has designed and developed an analytics-reporting platform based on Hadoop/MapReduce and Hive. Vlad completed a MS in Computer Science from Michigan State University and regularly attends various Big Data conferences throughout the country.