Distributed - Database / Application

Data System Architecture


Distributed systems is the opposite of a single node (ie computer).


Scale across the system bus before you scale across the network with distributed systems. It is all the same design principles. Think of design as fractal.


In a distributed application, different pieces of the app are called “services.”

A service for:

  • storing application data in a database,
  • image transcoding in the background (after a user uploads something)
  • for the front-end



  • How do we divide the work across machines (network latency), data locality (moving data may be very expensive),


dealing with failures. If a server fails on average every three years, with 10,000 nodes in our cluster we'll see 10 faults per day.

The simplest solution is to just launch another task, either on that machine if it's recovered, or on another machine.


dealing with stragglers (much more common than failure). (Nodes|Task) that have not failed, but are just running very slowly.

The simplest solution is to launch another task (on a different machine if needed) and then kill the original task.

Distributed system

Two-Phase Commit

The Two-Phase Commit is fairly standard for synchronous processing in order to avoid inconsistent state in a distributed environment.

Eventually consistent

Distributed Database - Eventual consistency (Weak)

CAP theorem

Distributed Database - CAP Theorem (Consistency, Availability, Partition Tolerance)

With the CAP theorem in minde, distributed system has two priority strategy:

  • prioritized availability - NoSQL
  • prioritized consistency - newSQL


The holy grail of distributed data processing

  • Submit (commit) locally: queue and manage your jobs task locally, leverage your local resources
  • Run Globally: Acquire any resource that is capable and willing to run your job/task


Documentation / Reference

Discover More
Data System Architecture
Column Family Store

s are NoSql store that clusters the data by a set of key columns. The data is then partitioned / distributed across multiple machines according to the key columns. Storage is sparse since only columns...
Consistent Hashing
Cryptography - Hash

A hash function is an encryption crypto algorithm that takes as data as input (possibly large and of variable-sized) and produces a short fixed-length integer value (generally printed as an hexadecimal...
Card Puncher Data Processing
Data Integration - Synchronization

duplicate of of ? Ensure that all instances of a repository (database, file system, ...) contain the same data. Its not a trivial task when the data is volatile. Replication is the process of copying...
Card Puncher Data Processing
Data Processing - Replication

Replication: Having a copy of the same data on multiple machines (nodes) in order to increase : Feature Example Performance serve reads in parallel, distributing application workloads across multiple...
Data System Architecture
Data Properties and Transactions

This section is : transactions are the mechanism that guarantee data ACID (atomicity, consistency, isolation, durability) properties during concurrent data changes (ie multiple user or program...
Data System Architecture
Data Property - ACID (atomicity, consistency, isolation, durability)

In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee transactions are processed reliably. They defines a concurrency model. Traditional database...
Data System Architecture
Distributed - Tracing

tracing in a distributed world. Instrumenting asynchronous application for distributed tracing is quite challenging because most tracing libraries rely on Thread-local_storagethread local storage...
Data System Architecture
Distributed -Two-Phase Commit

The Two-Phase Commit is fairly standard for synchronous processing in order to avoid inconsistent state in a distributed environment. 1) User wants to Update; 2) Prepare; 3) Write to Log; 4) Ready;...
Cap Theorem Database Type
Distributed Database - CAP Theorem (Consistency, Availability, Partition Tolerance)

This theorem from Eric Brewer in 2000, followed up later by Lynch in 2002 state that a distributed database can't get all these three notions at the same time: consistency - data is the same for every...
Card Puncher Data Processing
Docker - Service

Services in the context of docker are really just containers. To define, run just write a docker-compose.yml file and run it as an app List running services associated with an app List tasks...

Share this page:
Follow us:
Task Runner