Big Data/Cassandra

Search for Apache Cassandra on Wikipedia.

Apache Cassandra is a NoSQL wide column-oriented database management system, distributed and scalable. In 2015, it has become one of the world's most popular SGBD^[1].

Installation

The Java sources are available on https://github.com/apache/cassandra, but a tarball is on http://cassandra.apache.org/download/.

MacOS: brew install cassandra && brew services start cassandra

See also http://cassandra.apache.org/doc/latest/getting_started/installing.html for more information.

To launch the server:

On Linux: /cassandra/bin/cassandra
On Windows: \cassandra\bin\cassandra.bat

Graphical user interface

There are several GUI to manage Cassandra. For example Helenos: its Java sources are available on https://github.com/tomekkup/helenos, and a compiled version on http://sourceforge.net/projects/helenos-gui/.

It includes an Apache + Tomcat server, launchable by \helenos\bin\startup.bat. Then, the web interface must be visible on http://localhost:8080 (login: admin / password: admin).

Helenos screenshot

NB: it can create some column families, but not see the ones which were created in CQL.

Data manipulation

In 2011 Cassandra introduced the Cassandra Query Language (CQL)^[2]^[3], you can interact with CQL using the cqlsh client. Using cqlsh you can create w:keyspaces and tables, insert and query tables among other operations. The CQL 3.0 syntax looks like this^[4]:

CREATE KEYSPACE MyBase1 WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE MyBase1;

CREATE TABLE MyTable1 (
id text,
FirstName text,
LastName text,
 PRIMARY KEY(id));

INSERT INTO MyTable1 (id, LastName) VALUES ('1', 'Test');

SELECT * FROM MyTable1;

DROP TABLE MyTable1;

Additional Notes:

There isn't any autoincrement option.
No case-sensitive field names.
Inserting a new record with an existing primary key will replace the old one, without any warning.
When inserting more than 1,000 records, cqlsh may ignore the rest. It's recommended to use the ETL sstableloader.

Cassandra port usage

7000, cluster communication ^[5]
7001, cluster communication if SSL enabled ^[6]
7199 JMX (was 8080 pre Cassandra 0.8.xx)^[7]
9042 CQL native clients
9160 Thrift client API^[8]

How to use several nodes

To communicate from one server to another Cassandra needs to open the ports^[9]: 7000, 7001, 7199 (SSL), 9042 and 9160.

There isn't any master node, so the fail-over is automatic. Each node must own a "seed node" in its configuration, to get the distributed architecture. Their description is stored into \cassandra\conf\cassandra-rackdc.properties.

To let the nodes communicate, into cassandra.yaml, the parameter endpoint_snitch must be RackInferringSnitch (instead of SimpleSnitch by default).

Then, the nodes list is visible with:

On Linux: \cassandra\bin\nodetool status
On Windows: \cassandra\bin\nodetool.bat status

NB: when a keyspace is cerated with a replication_factor superior to one, the nodes become redundant (mirroring).

Related Technologies

Amazon Dynamo^[10] - uses similar concepts like data distribution, vault tolerance
BigTable - uses similar data model (column-families)
Redis - in memory key value database^[11]
MongoDB

References

Apache Cassandra - home page
A. Lakshman and P. Malik "Cassandra: a decentralized structured storage system" ACM SIGOPS Operating Systems Review, Volume 44 Issue 2, April 2010, Pages 35-40, ACM New York, NY, USA

[1] ttp://db-engines.com/en/ranking

[2] ttps://grokbase.com/t/cassandra/user/1162fkpwx2/release-0-8-0

[3] ttps://docs.datastax.com/en/cql/3.3/cql/cqlIntro.html

[4] ttps://cassandra.apache.org/doc/cql3/CQL.html

[5] ttp://cassandra.apache.org/doc/latest/faq/index.html#what-ports

[6] ttp://cassandra.apache.org/doc/latest/faq/index.html#what-ports

[7] ttps://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used

[8] ttps://stackoverflow.com/questions/2359159/cassandra-port-usage-how-are-the-ports-used

[9] ttp://docs.datastax.com/en/cassandra/2.0/cassandra/initialize/initializeSingleDS.html

[10] ttps://en.wikipedia.org/wiki/Amazon_DynamoDB

[11] ttps://en.wikipedia.org/wiki/Redis

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]