Apache HBase Tutorial: Introduction to HBase

Our Apache HBase Tutorial will help you to get a clear idea about HBase. We have covered all the basic and advanced concepts of HBase. We have designed this tutorial in such a way that it will help beginners as well as professionals.

This tutorial will help you to understand all the basic concepts of HBase along with various related topics such as What is HBase, HBase history, HBase architecture, HBase components, Why Hbase, etc.

Prerequisites

There is no special requirement to learn the concepts of HBase. You just need to have some basic understanding of using a terminal and applications. However, we have a well built and organized series of topics under the Apache HBase Tutorial to help you in learning HBase concepts from scratch.

Audience

Tutorials on TutorialsMate are designed to help beginners and professionals. Our HBase Tutorial will help beginners to master in HBase.

Problem

Our tutorial is designed by professionals, and we assure you that you will not find any kind of problem. In case there is any mistake, we request you to submit the problem using the contact form.

What You Will Learn

HBase Tutorial [Show/ Hide Index]

• What is Apache HBase?

• HBase History

• HBase Architecture

• HBase Components

• Why use HBase?

• HBase working

• Advantages of HBase

• Disadvantages of HBase

• Hadoop vs HBase

• HBase Installation

• Summary

What is Apache HBase?

Apache HBase is a distributed, scalable, non-relational (NoSQL) big data store that runs on top of HDFS. It is an open-source database that provides real-time read/write access to Hadoop data. It is column-oriented and horizontally scalable. HBase can host very large tables such as billions of rows and millions of columns. It can combine data sources that use a wide variety of different structures and schemas. HBase has the ability to store a massive amount of data from terabytes to petabytes.

HBase is a data model that is similar to Google's big table, used to known as the Google Big Table initially. It supports quick random access to huge amounts of structured data. Afterward, it was re-named as HBase and is primarily written in Java.

History of HBase

HBase Tutorial - History of Apache Hbase

The HBase story began in 2006. The San Francisco-based startup Powerset started working to build a natural language search engine for the Web. Then, in early 2007, Mike Cafarela dropped a tarball of thirty-odd Java files into the Hadoop issue tracker and added: ”I’ve written some code for HBase, a BigTable-like file store. It’s not perfect, but it’s ready for other people to play with and examine.”

Jim Kellerman took Mike’s dump and started working on gaps and drops. He added many tests for getting it into shape so that it could be committed as part of Hadoop. The first successful commit of the HBase code was made by Doug Cutting on April 3, 2007, under the contrib subdirectory. Later, the first HBase “working” release was bundled as part of Hadoop 0.15.0 in October 2007. HBase became a top-level Apache project in 2010.

Apache Hbase Architecture

The Apache HBase consists of all the features of the original Google Bigtable paper such as in-memory operations, Bloom filters, and compression. The components of this database can serve as the input as well as output for MapReduce jobs on the Hadoop ecosystem after MapReduce processes the data. The data can be accessed through the Java API or the REST API or even the Thrift and AVRO gateways.

HBase is a column-oriented key-value data store which works extremely fine with the data that Hadoop processes. It is comparatively fast when it comes to performing read/write operations and does not lower the quality even when the datasets are humongous. Therefore it is widely used by corporations for its high performance and low input/output latency. It is not the replacement for the SQL database, but it is better to have an SQL layer on top of HBase. So that it can be integrated with various business intelligence and analytics tools.

Components of HBase

Hmaster, HRegion, and Region are the main components of HBase.

Hmaster

Hmaster is a master server which is used for monitoring the all-region server in a cluster. It allocates the regions (table) to the region servers and also handles the load balancing across multiple region servers.

HRegion

HRegion Server is a slave server which is responsible for serving and managing regions. Each Region server has the responsibility to serve a set of regions.

Region

Region stores the subset of table data. If a table becomes too big, the table is partitioned into multiple Regions.

Why use HBase?

HBase supports large amounts of data by running on clusters. HBase was designed to access as well as store the data at the same time. The data is distributed across a cluster automatically. Sharding divides different data across multiple servers. Each server works as the source for a subset of data. Distributed data is accessed together, which makes the scaling process faster.

HBase can host very large tables for interactive and batch analytics. It is a great choice to store multi-structured or sparse data. Apache HBase can be used when there is a need for random, real-time read/write access for big data. It is natively integrated with Hadoop and can work seamlessly with other data access engines such as Apache Spark, Apache Hive, and MapR Database.

A table may contain billions of rows in any popular web application. If there is a need to search a particular row from such a huge amount of data, HBase is an ideal choice. Most of the online analytics applications prefer HBase. Many traditional data models could not meet the performance requirements of very big databases which could be overcome by Apache HBase.

How HBase Works?

HBase linearly scales the data by requiring all the tables to have a primary key. The key space is distributed into sequential blocks, which are then allotted to a region. RegionServers keep one or more regions, so the total load is divided uniformly across the cluster. Whenever the keys are accessed within a region, HBase can further divide the region automatically by splitting it again, so that manual data sharding is not necessary.

HMaster and ZooKeeper servers provide information about the cluster topology to the clients. Clients further connect to these and download a list of RegionServers. RegionServers consists of memstore to cache frequently accessed rows in memory.

Advantages of HBase

• It is linearly and modularly scalable across various nodes. It provides seamless and quick scaling to meet additional requirements.
• It contains completely distributed architecture and works on extremely large scale data.
• It is highly secure and provides easy management of data.
• It provides an unprecedented high write throughput.
• It can be used for both structured and semi-structured data types.
• HBase provides consistent read/write operations.
• It is good to use when you don’t need full RDBMS capabilities.
• It provides atomic read and write operation, which means during one read or write operation; all other processes are stopped from doing any read or write operations.
• It supports Java API for client access.
• Table sharding is easy to configure and automatize.
• Client access is seamless with Java APIs.
• It provides Thrift and REST API support for non-Java front ends which supports other encoding options such as XML, Protobuf and binary data encoding.
• It is accessible to a Block Cache and Bloom Filters for real-time queries and high volume query optimization.
• HBase gives automatic failure support between Region Servers.
• It supports exporting metrics with the Hadoop metrics subsystem to files.
• It doesn’t enforce a relationship within your data.
• It supports storing and retrieving data with random access.
• The MapReduce jobs can be backed up with HBase Tables.

Disadvantages of HBase

• HBase does not support partial keys completely.
• In HBase, It’s tough to store large size of binary files.
• The storage of HBase provides limited real-time queries and sorting.
• It allows only one default sort per table.
• It has slow improvements in the security for the different users to access the data from Apache HBase.
• Range lookup and Key lookup in terms of searching table contents using key values which limit queries that perform in real-time.
• Default indexing is not present in HBase. Programmers have to define several lines of codes or scripts to perform indexing functionality in HBase.
• It is expensive in terms of Hardware requirements and memory blocks allocations.
• HBase would require a new design when we want to migrate data from RDBMS (Relational Database Management System) external sources to HBase servers, which will take a lot of time.

Difference between Hadoop/HDFS and HBase

HDFS	HBase
It is a distributed file system which is well suited for storing large data files.	It is a the database built on top of HDFS which provides fast record lookups (and updates) for large tablets.
It does not support fast individual record lookups.	HBase provides fast lookups for larger tables.
It has a high latency batch processing support.	HBase supports low latency access to single rows from billions of records randomly.
It provides only sequential access to data.	HBase internally uses Hash tables and gives random access, and it stores the data in indexed HDFS files for faster lookups.

Apache HBase Installation

HBase can be installed on Ubuntu when Hadoop and Java are already installed.

The step by step guide for installing HBase in Standalone mode is given below:

Download HBase

Step 1
Download HBase using the link [1]:
Link: Click here
It will look like the given screenshot. Click on a mirror site [2] to download HBase.

Step 2
Select the version you want to download. Always prefer the downloading latest version

Step 3
Click on the hbase-x.x.x-bin.tar.gz and HBase will start downloading its tar file. Copy/Paste the tar file into an installation location.

Installation Process

• Place hbase-x.x.x-bin.tar.gz in /home/hduser

• Unzip it by executing the command:


$tar -xvf hbase-x.x.x-bin.tar.gz

Scroll ⇀

It is used to unzip the contents. It will also create an hbase-x.x.x folder in the location /home/hduser.

HBase Tutorial - Apache HBase Installation 1

• Open file hbase-env.sh as below and mention JAVA_HOME path in the location.

HBase Tutorial - Apache HBase Installation 2

• Replace the existing JAVA_HOME value with your current value as mentioned below:


export JAVA_HOME=/usr/lib/jvm/java-x.x.x

Scroll ⇀

• Open file ~/.bashrc following the same way as mentioned above and update HBASE_HOME path as:


export HBASE_HOME=/home/hduser/hbase-x.x.x export PATH= $PATH:$HBASE_HOME/bin

Scroll ⇀

• Open the hbase-site.xml file and update the following properties within the configuration:


<property>
<name>hbase.rootdir</name>
<value>file:///home/hduser/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/HBASE/zookeeper</value>
</property>

Scroll ⇀

Following this way, we have added two properties:

1. For HBase root directory

2. For Data directory correspond to ZooKeeper

All HBase and ZooKeeper activities directly refer to this hbase-site.xml file.

• Now, run command Start-hbase.sh in hbase-x.x.x/bin location to start the HBase. To check if HMaster is running or not, we can use jps command.

• HBase shell can be started by using “hbase shell” command, and it will enter into interactive shell mode where we can perform all types of commands.

Summary

Hadoop deployment is extremely rising with each passing day, and HBase is the platform for working on top of the HDFS. Upon learning HBase, one can easily perform various operations, deploy Load Utility to load a file, integrate it with Hive, and learn about the HBase API and the HBase Shell.

Hence, in this Apache HBase Tutorial, we discussed a brief introduction of HBase. Moreover, we saw HBase architecture, components, advantages & disadvantages, and the need for HBase.