Db - Cassandra

About

NoSql Column-Oriented DB

The NoSql approach to data modeling is query centric, in which specific queries define the structure. Data is arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization.

A table is designed to satisfy a query that should support a process (user registration, user login, …)

Joins are not supported, they are performed client side or all required fields (columns) are available in a single table.

There is no concept of foreign keys or relational integrity.

In contrario, a relational database’s approach to data modeling is table-centric.

Data Structure

  • Keyspace: defines how a dataset is replicated, for example in which datacenters and how many copies. Keyspaces contain tables.
  • Table: defines the typed schema for a collection of partitions. Cassandra tables have flexible addition of new columns to tables with zero downtime. Tables contain partitions, which contain partitions, which contain columns.
  • Partition: defines the mandatory part of the primary key all rows in Cassandra must have. All performant queries supply the partition key in the query.
  • Row: contains a collection of columns identified by a unique primary key made up of the partition key and optionally additional clustering keys.
  • Column: A single datum with a type which belong to a row.

Primary key

PRIMARY KEY (
   (partitionCol1, ..., partitionColN),
   clusteringCol1, ..., clusteringColN
)
  • the columns combination define the uniqueness of the record in the database.
  • partitionCol1, …, partitionColN defines the partition that define the data localization inside the cluster (known as data locality). When data is inserted into the cluster, the first step is to apply a hash function that generate the partition key. This key is used to determine what node (and replicas) will get the data. The goal is to distribute the data evenly.
  • clusteringCol1, …, clusteringColN are clustering columns that specifies the default order inside a partition (ie the default order returned by a query). The CLUSTERING ORDER BY clause can be used to specify it with directionality (ASC, DESC) in a CREATE table statement. Clustering order is a pre-sorting feature.

Time Serie

Cassandra has support for modelling time series data wherein each row can have dynamic number of columns.

When CLUSTERING ORDER BY is used in time series data models, As an example:

  • with the following CLUSTERING ORDER BY
PRIMARY KEY (userid, added_date, videoid)
  • we can quickly access the last N items inserted.
SELECT * FROM user_videos WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 LIMIT 10;

Installation

docker run ^
  --name cassandra ^
  -d ^
  cassandra:3.11.5 
  • then
docker exec -it cassandra bash
cqlsh localhost
  • Query
SELECT cluster_name, listen_address FROM system.local;
 cluster_name | listen_address
--------------+----------------
 Test Cluster |     172.17.0.3
  • Create a keyspace (A keyspace is the cassandra name for a SQL schema) - <note important>default, schema are built-in words that cannot be used otherwise you get: SyntaxException: line 1:16 no viable alternative at input 'schema' (create keyspace [schema]…)</note>
create keyspace mySchema with replication = {'class':'SimpleStrategy','replication_factor':1};
use mySchema;
CREATE TABLE t (
    pk int,
    t int,
    v text,
    s text static,
    PRIMARY KEY (pk, t)
);

INSERT INTO t (pk, t, v, s) VALUES (0, 0, 'val0', 'static0');
INSERT INTO t (pk, t, v, s) VALUES (0, 1, 'val1', 'static1');

SELECT * FROM t;
 pk | t | s       | v
----+---+---------+------
  0 | 0 | static1 | val0
  0 | 1 | static1 | val1

Client


Powered by ComboStrap