Distributed systems is the opposite of a single node (ie computer).
Scale across the system bus before you scale across the network with distributed systems. It is all the same design principles. Think of design as fractal.
In a distributed application, different pieces of the app are called “services.”
A service for:
dealing with failures. If a server fails on average every three years, with 10,000 nodes in our cluster we'll see 10 faults per day.
The simplest solution is to just launch another task, either on that machine if it's recovered, or on another machine.
dealing with stragglers (much more common than failure). (Nodes|Task) that have not failed, but are just running very slowly.
The simplest solution is to launch another task (on a different machine if needed) and then kill the original task.
The Two-Phase Commit is fairly standard for synchronous processing in order to avoid inconsistent state in a distributed environment.
Distributed Database - CAP Theorem (Consistency, Availability, Partition Tolerance)
With the CAP theorem in minde, distributed system has two priority strategy:
The holy grail of distributed data processing