Big Data/Storm

Howto: Setting up a multi-node Storm cluster under Ubuntu

This tutorial explains how to set up a Storm cluster running on several Ubuntu machines. The Storm framework allows to process unbounded data streams in a distributed manner in real-time.

The basic idea behind the distributed real-time processing of Storm is splitting the overall task into several smaller tasks which can be executed quickly. Each of these tasks is executed by a so called bolt which is one type of vertex in a directed processing graph named topology. Those bolts consume one or more data input streams and produce an output data stream. Beside bolts, there exists a second type of vertices in a topology named spouts which transform an input stream into a data stream which can be processed by bolts. The directed edges in the topology represent the data flow from spouts to bolts or from bolts to bolts.^[1]

In order to execute a topology, it is send to the master node of the Storm cluster which runs a daemon called "Nimbus". This daemon distributes the single tasks across the cluster. On all machines of the cluster runs a special daemon called "Supervisor" which receives the tasks assigned to him and executes them in one or more worker processes. The coordination between the Nimbus and the Supervisors is done through a ZooKeeper cluster.^[1]

Figure 1: Overview of the cluster

Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of four Ubuntu machines. Each of them is running a Supervisor daemon. The Nimbus daemon is running on the first machine and the ZooKeeper server on the second.

This tutorial was tested with the following software versions:

Ubuntu Linux 12.04.3
Java version 1.8.0-ea
ZooKeeper 3.4.5
ZeroMQ 2.1.7
Storm 0.8.2

Prerequisites

The following required software has to be installed on all machines.

Java 8

In order to install Java, you have to install the two further packages, first. Therefore, execute:

sudo apt-get install software-properties-common
sudo apt-get install python-software-properties

Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

Finally, Java 8 can be installed:

sudo apt-get install oracle-java8-installer

SSH Server

SSH is not required for running Storm but it simplifies the installation and running of a Storm cluster because you get remote access. Therefore, an SSH server should run on all machines.

First check if SSH is already running by executing

ssh address.of.machine

for each machine which should be involved in the cluster. address.of.machine is the address of a machine. If the login on one of the machines fails, you have to install an ssh server on that machine by following the instructions on http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/

Install ZooKeeper

In the current setting the ZooKeeper server should run on Machine02. Therefore, we first connect to this machine by executing:

ssh userName@address.of.machine02

userName is the name of the user which should be used to log in on machine02. If the user name is the same as on the machine which wants to get remote access, then the following command would be enough:

ssh address.of.machine02

First, a folder has to be created which will be used from ZooKeeper to store data and create log files:

mkdir -p /path/to/zookeeper/data

In this this tutorial, this path is referred to by the following place holder /path/to/zookeeper/data.

Now, ZooKeeper can be downloaded

wget http://mirror.synyx.de/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz

and extracted

tar -xzf zookeeper-3.4.5.tar.gz

In order to configure the ZooKeeper server we first create a configuration file by copying a sample configuration.

cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg

This newly created configuration file zoo.cfg is edited by changing the value of dataDir to the new path for the ZooKeeper data.

dataDir=/path/to/zookeeper/data

Now, the installation of ZooKeeper is complete. The ZooKeeper server can be started by executing

./zookeeper-3.4.5/bin/zkServer.sh start

In order to check, if the server is running correctly, the following commands can be executed:

echo ruok | nc addressOfServer 2181
echo stat | nc addressOfServer 2181

To shut down the server again simply type

./zookeeper-3.4.5/bin/zkServer.sh stop

Install ZeroMQ

The communication between the spouts and bolts of Storm is done with ZeroMQ. ZeroMQ is a socket library that transports messages between processes on the same or on different machines. First the native C-library of ZerMQ has to be built and installed. Thereafter, the Java library JMQ has to be built and installed which calls the native ZerMQ library. These steps have to be performed on all machines.

Build and install native ZeroMQ

In order to build ZeroMQ the uuid-dev development environment has to be installed.

sudo apt-get install uuid-dev

Now, the source code of ZeroMQ can be downloaded

wget http://download.zeromq.org/zeromq-2.1.7.tar.gz

and extracted

tar -xzf zeromq-2.1.7.tar.gz

After extracting ZeroMQ, it can be built and installed. Therefore, the previously extracted folder has to be entered

cd zeromq-2.1.7

the build environment has to be configured

./configure

the library has to be build

make

and finally installed

sudo make install

Now, the current folder can be left.

cd ..

Build and install native JZMQ

JZMQ requires several further programms:

a program to download the source code from a remote repository

sudo apt-get install git

a program which provides a common interface to query installed libraries

sudo apt-get install pkg-config

a program for creating portable compiled libraries

sudo apt-get install libtool

and a program which creates portable makefiles: automake

The current version of the latter one could not be used, because it produces several errors when building JZMQ. In the setting described in this tutorial the version 1.11.1 works well. This version has to be downloaded

wget http://launchpadlibrarian.net/60345882/automake_1.11.1-1ubuntu1_all.deb

and installed

sudo dpkg --install automake_1.11.1-1ubuntu1_all.deb

The latter command produces an error caused by missing dependencies. These missing libraries can be installed by executing

sudo apt-get -f install

In order to prevent automake from automated updates, simply type

echo "package hold" | sudo dpkg --set-selections

Now, JMQ can be build and installed. Therefore, a remote repository has to be cloned which contains a version of JMQ that is compatible with Storm:

git clone https://github.com/nathanmarz/jzmq.git

The newly created folder is entered

cd jzmq

Thereafter, JZMQ can be built

./autogen.sh
./configure
make

and installed

sudo make install

Even if there is no error, the installation is incomplete. In order to make JZMQ work properly with Storm, a missing jar has to be created

cd /src
jar -cvf zmq.jar ./org/
sudo cp zmq.jar /usr/share/java/zmq.jar
cd ../..

Furthermore, some libraries have been stored at a position where Ubuntu does not find them. Theses libraries have to be copied to the correct position, now.

sudo cp jzmq/perf/zmq-perf.jar /usr/share/java/zmq-perf.jar
sudo cp /usr/local/lib/. /usr/lib -R

Finally, Ubuntu has to reload all libraries.

sudo ldconfig

Install Storm

After finishing the installation of ZeroMQ, Storm can be installed on all machines.

Storm is distributed in zip archives. In order to extract them, unzip has to be installed.

sudo apt-get install unzip

Similar to the installation of ZooKeeper, a folder has to be created which will be used from Storm to store data:

mkdir -p /path/to/storm/data

In this this tutorial, this path is referred to by the following place holder /path/to/storm/data.

Now, Storm can be downloaded

wget https://dl.dropboxusercontent.com/s/fl4kr7w0oc8ihdw/storm-0.8.2.zip

and extracted

unzip -q storm-0.8.2.zip

The extracted folder storm-0.8.2 contains the folder logs in which the log files will be written, bin which contains the scripts to run Storm, and conf which contains configuration files for Storm. In order to set up Storm correctly, the file conf/storm.yaml has to be modified:

storm.zookeeper.servers:
     - "address.of.machine02"
nimbus.host: "address.of.machine01"
storm.local.dir: "/path/to/storm/data"

In the first two lines the address of the used ZooKeeper server is defined, which is run on Machine02 in this tutorial. In the third line the address of the Nimbus is set. Finally, the position where Storm should store its data is defined in the last line. Now, the Storm cluster is ready to start.

Run Storm

Before the Storm cluster can be started, the ZooKeeper server has to be run. Therefore, the following command has to be executed on Machine02:

./zookeeper-3.4.5/bin/zkServer.sh start

The following commands, which will start the parts of the Storm cluster, are not executed in the background. This means that for each of them a new terminal is required. The programs can be stopped by pressing Ctrl + C.

The Storm cluster is started, by running the Nimbus on Machine01, first:

./storm-0.8.2/bin/storm nimbus

Thereafter, a web server can be started on the same machine as the Nimbus. It provides a user interface for the storm cluster, which can be accessed under the URL http://address.of.machine01:8080 by a web browser.

./storm-0.8.2/bin/storm ui

Finally, the Supervisors have to be started on all machines.

./storm-0.8.2/bin/storm supervisor

Execute a topology

When a Storm topology has been created (see ^[1]) it has to be packed into a jar. All depending Java classes have to be included in this jar. This means, that other jars have to be unjared and the extracted classed integrated into the jar. Classes which are already part of Storm, must not be included in the jar. ^[2]

The previously created topology can be executed by performing the following command on the same machine as the Nimbus:

./storm-0.8.2/bin/storm jar path/to/topology.jar qualified.name.of.MainClass aUniqueName

The topology can be stopped by executing

./storm-0.8.2/bin/storm kill aUniqueName

or using the user interface.

Further links

http://storm-project.net/

https://github.com/nathanmarz/storm/wiki

http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/

I. Bedini and S. Sakr and B. Theeten and A. Sala and P. Cogan "Modeling Performance of a Parallel Streaming Engine: Bridging Theory and Costs" Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE '13), 2013, Pages 173-184, ACM New York, NY, USA

References

<references> ^[1] ^[2]

↑ ^1.0 ^1.1 ^1.2 ^1.3 Marz N., (2011) Tutorial. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Tutorial
↑ ^2.0 ^2.1 Marz N., (2013) Running topologies on a production cluster. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Running-topologies-on-a-production-cluster

[Marz2011T-1] 1.0 ^1.1 ^1.2 ^1.3 Marz N., (2011) Tutorial. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Tutorial

[Marz2013RTO-2] 2.0 ^2.1 Marz N., (2013) Running topologies on a production cluster. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Running-topologies-on-a-production-cluster

[1]

[2]