Cassandra NoSql Database

About

Cassandra is a NoSql database for transactional workloads that require high scale and maximum availability.

Cassandra is suited for transactional workloads at high volume and shouldn’t be considered as a data warehouse

Model

Query First Design

You don’t start with the data model; you start with the query model.

It is designed to work on a denormalized data model.

Data is arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization.

A table is designed to satisfy a query that should support a process (user registration, user login, …)

Joins are not supported,

they are performed client side
(preferred) or a second table was created to represent with the join results

Performing joins on the client should be a very rare case; you really want to duplicate (denormalize) the data instead.

There is no concept of foreign keys or relational integrity.

In contrario, a relational database’s approach to data modeling is table-centric.

Materialized views are available (since 3.0) allows to create multiple denormalized views of data and keeps it in sync with the table

A key goal is to minimize the number of partitions that must be scanned in order to satisfy a query. Because the partition is a unit of storage that does not get divided across nodes, a query that searches a single partition will typically yield the best performance

Sorting

The sort order available on queries is fixed, and is determined entirely by the selection of clustering columns you supply in the CREATE TABLE command.

The CQL SELECT statement does support ORDER BY semantics, but only in the order specified by the clustering columns (ascending or descending).

Procedure

The user interface design for the application is often a great artifact to use to begin identifying queries

You create a logical model containing a table for each query, capturing entities and relationships from the conceptual model

To name each table, identify the primary entity type for which you are querying, and use that to start the entity name. If you are querying by attributes of other related entities, you append those to the table name, separated with _by_; for example, hotels_by_poi.

Next, identify the primary key for the table with:

the partition key columns based on the required query attributes,
and the clustering columns in order to guarantee uniqueness and support desired sort ordering

If any of the other additional attributes are the same for every instance of the partition key, mark the column as static. For instance, if a name is in the partition key, you could had a description attribute as static column.

Artem Chebotko Diagram helps visualize the relationships between queries and tables:

Each table is shown with its title and a list of columns.
Primary key columns are identified via symbols such as K for partition key columns and C↑ or C↓ to represent clustering columns.
Lines are shown entering tables or between tables to indicate the queries that each table is designed to support

¹⁾

If you’re searching by attribute, this attribute should be a part of the primary key.

KKV store

It's not a KV store (key value) but a KKV store ²⁾

First Key is the partition key (the partition identifier mades up of the node and the partition file) - this key should be almost used in all query.
Second Key is the clustering key (the row identifier inside the partition - sorting key) - this key should be chronological (ie a surrogate key that can be sorted by time).
V: value

Example:

CREATE TABLE messages (
   channel_id bigint,
   bucket int,
   message_id bigint,
   author_id bigint,
   content text,
   PRIMARY KEY ((channel_id, bucket), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

where:

channel_id, bucket is the partition key
message_id is the clustering key

Data Structure

Keyspace: defines how a dataset is replicated, for example in which datacenters and how many copies. Keyspaces contain tables.
Table: defines the typed schema for a collection of partitions. Cassandra tables have flexible addition of new columns to tables with zero downtime. Tables contain partitions, which contain partitions, which contain columns.
Partition: defines the mandatory part of the primary key all rows in Cassandra must have. All performant queries supply the partition key in the query.
Row: contains a collection of columns identified by a unique primary key made up of the partition key and optionally additional clustering keys.
Column: A single datum with a type which belong to a row.

Primary key

PRIMARY KEY (
   (partitionCol1, ..., partitionColN),
   clusteringCol1, ..., clusteringColN
)

the columns combination define the uniqueness of the record in the database.
partitionCol1, …, partitionColN defines the partition that define the data localization inside the cluster (known as data locality). When data is inserted into the cluster, the first step is to apply a hash function that generate the partition key. This key is used to determine what node (and replicas) will get the data. The goal is to distribute the data evenly.
clusteringCol1, …, clusteringColN are clustering columns that specifies the default order inside a partition (ie the default order returned by a query). The CLUSTERING ORDER BY clause can be used to specify it with directionality (ASC, DESC) in a CREATE table statement. Clustering order is a pre-sorting feature.

Time Serie

Cassandra has support for modelling time series data wherein each row can have dynamic number of columns.

When CLUSTERING ORDER BY is used in time series data models, As an example:

with the following CLUSTERING ORDER BY

PRIMARY KEY (userid, added_date, videoid)

we can quickly access the last N items inserted.

SELECT * FROM user_videos WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 LIMIT 10;

Installation

Database

Docker documentation - (code)

docker run ^
  --name cassandra ^
  -d ^
  cassandra:3.11.5

then

docker exec -it cassandra bash

and Cqlsh

cqlsh localhost

Query

SELECT cluster_name, listen_address FROM system.local;

cluster_name | listen_address
--------------+----------------
 Test Cluster |     172.17.0.3

Create a keyspace (A keyspace is the cassandra name for a SQL schema) - default, schema are built-in words that cannot be used otherwise you get: SyntaxException: line 1:16 no viable alternative at input 'schema' (create keyspace [schema]…)

create keyspace mySchema with replication = {'class':'SimpleStrategy','replication_factor':1};
use mySchema;

Create a table

CREATE TABLE t (
    pk int,
    t int,
    v text,
    s text static,
    PRIMARY KEY (pk, t)
);

INSERT INTO t (pk, t, v, s) VALUES (0, 0, 'val0', 'static0');
INSERT INTO t (pk, t, v, s) VALUES (0, 1, 'val1', 'static1');

SELECT * FROM t;

pk | t | s       | v
----+---+---------+------
  0 | 0 | static1 | val0
  0 | 1 | static1 | val1

Db Proxy

https://stargate.io/

¹⁾

Reference Artem Chebotko Diagram

²⁾

KKV store