Apache Helix Tutorial

Apache Helix Tutorial - TutorialsMate

What is Apache Helix?

Apache Helix is an open-source generic cluster management framework developed by Linkedln Corporation. It is used for the automatic management of replicated, partitioned, and distributed resources which are hosted on a cluster of nodes. Helix automatically provides reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration.

Features of Apache Helix

Helix provides the following key features:
Assignment of resources/partitions to nodes automatically
Detection and recovery of node failure
Addition of resources dynamically
Addition of nodes to the cluster dynamically
Availability of pluggable distributed state machines to manage the state of a resource via state transitions

Apache Helix Architecture

Helix separates distributed system components into 4 logical roles, as shown in the figure below:

Apache Helix Architecture
These are the logical components which can be physically co-located within the same process. Each of these logical components has an embedded agent that interacts with other components via Zookeeper.

Components Highlights

Controller

A controller is known as the brain of the system. It acts as a generator for the state machine engine. It runs the execution algorithm and issues transitions against the distributed system.

Participant

A participant is responsible for the execution of state transitions. Whenever the controller initiates a state transition, the participant triggers the callbacks, which is implemented by the distributed system. In the case of the search system, the system implements the callbacks (offlineToBootstrap, BootstrapToOnline), and are invoked when the controller fires respective transitions. In the above figure, p1 and p2 indicate the partitions of the Resource (Database, Index, and Task). Different colors indicate the state (e.g., master/slave) of the partition.

Spectator

A spectator represents the external entities that need to observe the state of the system. It is notified if the state changes in the system so that they can interact with the appropriate participant. For example, a proxy/routing component (a spectator) can be notified when the routing table has changed, such as when a cluster is expanded, or a node fails.

ZooKeeper

Helix relies on a ZooKeeper to meet all the given requirements:

1. Cluster state/metadata store.
2. Automatic notifications whenever cluster state changes.
3. A communication channel between the components.

By storing all the metadata/cluster state in Zookeeper, the controller itself is stateless and easy to replace in case of any failure occurs. Zookeeper also provides the notification mechanism whenever nodes start or stop. Zookeeper is also used to construct a reliable communication channel between controller and participants. The channel is modeled as a queue in Zookeeper. The controller and participants act as producers and consumers on this queue.

Why use Helix?

A single node feature was sufficient for most cases from a very long time. After the advent of big data, it’s no longer enough according to as per needs. Most of the applications need to run in a distributed setting. There will be many challenges, such as scalability, partition, and fault-tolerance with distributed computing. The revolution of such a system can be described as:
Why use Apache Helix?
Adding these capabilities is non-trivial, error-prone, and time-consuming. Each feature adequately increases the complexity of the system, but these capabilities are critical for production-ready systems expected to operate at a large scale. Most systems tried to go through this evolution, but very few could end up to the right end of the spectrum. 

Linkedln built many such distributed systems; the first one was Espresso, a NoSQL storage system. While working on a project, they observed the need to support common patterns such as tolerance to partition, hardware and software failures, and tasks such as operational issues, load balancing, bootstrapping and scaling. All of these features were the motivation behind building a generic framework for developing a distributed system which was called Helix. Developing a distributed system as a state machine with constraints on states and transitions has the following benefits:

Being a distributed system, Helix separates cluster management from the core functionality of the system.
It allows a quick transformation from a single node system to an operable, distributed system.
It increases simplicity. System components are not required to manage the global cluster.

How does Helix work?

To understand Helix, we must first learn about cluster management. A distributed system usually runs on multiple nodes because of the following reasons:

Scalability
Fault tolerance
Load balancing

Each node can perform one or more of the primary functions of the cluster. Those functions may include storing and serving data, producing and consuming data streams, and so on. Helix works as the global brain for the system once these functions are configured. It can make decisions which are not supposed to be made in isolation. There are some examples of such decisions which require global knowledge and coordination:

It can schedule maintenance tasks such as backup, file consolidation garbage collection, index rebuilds
It has the ability for the repartition of data or resources across the cluster.
It throttles system tasks and changes.
It works for informing dependent systems of changes so that they can react appropriately to the cluster changes.

These functions can be integrated into the distributed system, but it will make the coding complicated. Although, Helix has common abstracted cluster management tasks which enable the system builder to model the expected behavior with a declarative state model. It provides accessibility and let Helix manage the coordination. Thus, the result is less new code to write, and a robust, highly operable system.

Usage of Helix in Linkedln Ecosystem

Helix has been under development at Linkedln since April 2011. Now, it is used in production in different systems:

Espresso

Espresso is a distributed, scalable, timeline consistent document store which supports local secondary indexing and local transactions. It runs on several storage nodes servers that store and index data and answer queries. Espresso databases are horizontally partitioned across different nodes, and each partition consists of a specified number of replicas. Espresso acts for designating one replica of each partition as master and the rest as slaves. There can only be one master for each partition at any time. Helix operates the partition management, cluster-wide monitoring, and mastership transitions during planned upgrades and unplanned failure. In case of any failure of the master, a slave replica is assigned to be the new master of the system.

Databus

A Databus is a change data capture (CDC) system. It is used to provide a common pipeline for transporting events from Linkedln primary databases to caches, indexes, and other applications. Databus deploys a cluster of relays which receive the change-logs from multiple databases and let consumers subscribe to the changelog stream. Databus processes a buffer on per partition basis. Databus continuously tracks the primary databases changes in the topology and automatically generates new buffers.

Databus Consumer

A partition of each Databus is randomly allotted to a consumer in such a way that every consumer exactly receives only one partition at a time. The set of consumers may change over time, and the consumer may leave the group due to outages. In such a case, a partition must be reassigned while maintaining balance and the single consumer-per-partition invariant. Applications like Data replicator and search indexer nodes follow this pattern.

SEAS (Search As A Service)

Seas allow other applications to define its custom indexes on a selected dataset and then make those indexes searchable via a service API. The index is divided into partitions, and each partition consists of configured number of replicas. Each new indexing service is assigned to a random set of servers, and the partition replicas must be distributed evenly across those servers. The search feature uses snapshots of the data source to build new index partitions whenever indexes are bootstrapped. Helix is also responsible for setting up limits for the number of concurrent bootstraps in the system.

Pinot

It is an online analytics engine responsible for an arbitrary roll-up and drill-down.
Apache Helix Component Highlights

Installation

Here is the short version of Helix installation to understand about its processing:

Step1 

Download the Apache Helix release package using the link below:
https://helix.apache.org/0.6.2-incubating-docs/download.html

Step 2

Get to the Tools Directory.

Extract the package, then use a command:

cd helix/helix/helix-core/target/helix-core-pkg/bin
Scroll ⇀

Step 3

 Run the Demo using the command:

cd helix/helix/helix-core/target/helix-core-pkg/bin./quickstart.sh
Scroll ⇀

This command will enable the components working together in this demo, which does the following:

Create a cluster
• Add two nodes/participants to the cluster
• Set up a resource with six partitions and two replicas: one master, and one slave per node/partition
• Show the cluster state when Helix configure the partitions
• Add the third node
• Show the cluster state. It's important to understand that the third node has taken mastership of two partitions
• Kill the 3rd node (Helix takes care of failover)
• Show the cluster state again. The two surviving nodes take over mastership of the partitions from the failed node.

Step 4

In the initial setup, two nodes are set up, and partitions are rebalanced. The cluster state is described below:
Apache Helix Tutorial

Note: Here, we have only one master and one slave per partition.

Step 5

A third note is added, and the cluster is rebalanced. Now, the cluster state looks like:
Apache Helix Tutorial

Note: Now, there is one master and two slaves per partition because there are three nodes.

Step 6

At last, a node is killed to simulate a failure. Helix makes sure that each partition has a master. The cluster state changes to:
Apache Helix Tutorial

Now, you understand the idea of Helix with this quick guide.

Summary

Building distributed systems from scratch are non-vital, error-prone, and time-consuming. It is important to create the proper set of building blocks which can be leveraged across different systems. At Linkedln, Helix started with Espresso and continued to build a variety of distributed systems which further allowed us to build reliable and operational systems.



Subscribe To Get All The Latest Updates!


Latest Tutorial



Blog with us

CONNECT WITH US

Like Tutorialsmate on Facebook Follow Tutorialsmate on Facebook Add Tutorialsmate on Facebook Subscribe Tutorialsmate on Facebook Follow Tutorialsmate on Facebook Follow Tutorialsmate on Facebook
Get tutorialsmate on Google Play
© 2020 TutorialsMate. Designed by TutorialsMate