Modelling a system
We use upper bounds to describe time complexity of algorithms. It describes the “worst-case” for a given way of solving the problem. It’s a performance guarantee. Lower bounds aren’t as useful to prove.
Lower bounds are a powerful tool to understand fundamental limitations and model assumptions.
If the lower bound for a given problem is X, X is the minimum complexity for all possible solutions. The point of describing a lower bound for problems is to know when a better solution to the problem may exist.
It seems that when we want to use a system to solve a problem, we need to clarify:
- The goal of the system
- The assumptions the presuppose its success
- The limitations of the systems
- We must know what is possible and what is impossible, we need proofs for this.
- Once we know limitations, we can define what a system operating successfully can look like.
- Lower bounds on problems help us reason about the minimum complexity for given solutions. It proves what is possible.
An algorithm tells the system what to do.
A protocol is a set of rules that determines how a system functions.
Dealing with failure
FLP proof from 1985 shows that it is impossible for a faulty process in an async model to check if a process has failed or whether it is just slow. In the presence of unreliable communication, we cannot achieve consensus since we may never be able to communicate with a process.
Two army problem
- Two armies, A and B want to attack army C.
- A sends message: “lets attack at dawn.”
- B replies with OK. The message arrives at A, but A realizes B does not know whether their message OK reached A.
- B is not confident that they should attack since they don’t know if their message was received by A.
- Even if A sent the message, “received B’s confirmation.”, A cannot be sure it will arrive.
This problem demonstrates that even with non-faulty processors, agreement is not possible with unreliable communication channels.
This is why use upper bounds on communications to consider a process faulty if it does not respond within that bounded time.
A fail-stop is the condition where a failed process does not communicate. A byzantine fault is where the faulty process continues to communicate but may produce faulty information.
To make a consensus algorithm work against fail-stops, we will need an algorithm that allows the same value to be decided by all processes even if all processes receives at most (n-t) answers. Where there are n processes and of which t are faulty.
A fail-stop resilient algorithm:
- A process broadcasts its preferred value and the number of processes it has seen that also have that value (this count is called cardinality and initiated at 1)
- The process receives (n-t) answers, each containing a preferred value and cardinality.
- The process can change its preferred value to the one preferred by other processes; it updates the corresponding cardinality to the # responses it received with that value + itself.
- Continue this process until a process receives t messages of a single value with cardinality of n/2. This means half the system has agreed on the same value.
- As the number of phases approaches infinity, the probability of consensus not being reached approaches 0.
We can make it easier to develop algorithms by relaxing the definition of asynchrony and allowing partial synchrony. Different types of asynchrony exist in a system:
- Process asynchrony: a process may sleep or be suspended for an arbitrary amount of time.
- Communication asynchrony: no upper bound on the time a message takes to reach its destination.
- Message order asynchrony: messages may be delivered in a different order than sent.
It has been shown it is not sufficient to make processes synchronous but that any of the following cases is sufficient to make a consensus protocol possible:
- put an upper bound on sleep time and message transmission (process + comms synchrony)
- put an upper bound on process sleep time and ordering (process + message order synchrony)
- use message order synchrony and broadcast capability.
- use comms synchrony, broadcast capability, and send/receive atomicity.
Why distributed consensus
We want to build systems that expect fault. That are resilient to malicious and faulty actions.
With distributed systems, we don’t have shared memory. Only messages passed over an unreliable network.
The goal of distributed consensus is to allow a set of computers to agree on on a single value that one of the nodes in the system proposed, even if some of the messages from those nodes fail, get lost, or get corrupted.
The consensus algorithm cannot invent some random value for all processes to agree with; that value must have been submitted by a process.
In distributed transactions, we need unanimous agreement whether or not to commit. With distributed consensus, we need to get unanimous agreement on a single value.
Any algorithm relying on multiple processes maintaining a common state relies on solving the consensus problem. Where consensus is useful:
- syncing replicated state machines, and ensuring all replicas have a consistent view of state.
- electing a leader (mutual exclusion)
- distributed, fault tolerant logging with globally consistent sequencing.
- managing group membership.
- deciding to commit or abort.
Consensus is easy if we assume all processes are available to communicate and functioning. But the real world has faults (process failures or communication failures) which can cause indefinite delays and make the system unsafe to use.
SMR (State machine replication) 101
State machine. A state machine, at any point, stores a state of the system. It receives a set of inputs (also referred to as commands). The state machine applies these inputs in a sequential order using a transition function to generate an output and an updated state. A succinct description of the state machine is as follows:
state = init
log = 
on receiving cmd from a client:
state, output = apply(cmd, state)
send output to the client
For example, the state in the Bitcoin ledger is the list of all public keys + UTXOs, the cmd would be a transaction that becomes input to the state. Apply() would verify the tx against the state and execute state transition.
Fault tolerant SMR
The server replicas all initially start with the same state. However, when they receive concurrent requests from a client, honest replicas first need to agree on the sequence of client commands received by them. This problem is called log replication, and it is a multi-shot version of consensus.
After the sequence is agreed upon, the replicas apply the commands in the log, one by one, using the apply transition. Assuming the transition is deterministic, all honest server replicas maintain an identical state at all times.
Before we understand consensus, which is multiple nodes agreeing on something, we need to understand consistency.
Consistency in data systems is at least eventual - meaning if you read data that is inconsistent, at least it will eventually all be consistent across all nodes. Another way to say this is convergence.
Inconsistency example - we submit a write, want to read what we just submitted, and it’s not there.
Another word for linearizability is atomic consistency.
- If databases provide the illusion to each node that there is only one replica, then replication lag wouldn’t be an issue.
- Replication lag = when I query two different replicas, I get two different results.
- If there is only one replica, all operations on it are atomic. This makes it simpler for apps, which won’t have to worry about the replicas.
- Linearizability is a recency guarantee.
Network interruptions force a choice between availability and linearizability
- If they both stay up, the data won’t be consistent
- To stay consistent, one system has to go down.
- Because they can’t connect to each other.
- If it’s a single-leader database, one of the data centres contains the leaders, and data processed in centres without it will have write errors since they can’t contact the leader node.
- If it’s a multi leader database, both can stay up since data is copied asynchronously, the writes are queued and will be exchanged when network connection is restored.
If you must have linearizability, the tradeoff is availability; one of the replicas during network problems will go down. If the system can handle network problems without going down, its behaviour is not linearizable since the replicas can process requests independently. Meaning not providing the illusion of one copy of data and all atomic operations.
- CAP theorem is this insight: Consistency, Availability, Partition. They’re either consistent or available when partitioned but not both.
Every CPU core has its own memory cache and store buffer - this means they are not linearizable. The reason for dropping linearizability is for performance, not fault-tolerance. In networks of highly variable delays, read and write response times can be high. Weaker consistency models are much faster, so the trade-off is important for latency-sensitive systems.
In other words:
If latency and/or availability is the priority; weaker consistency.
If consistency is the priority; performance and availability may suffer.