Notes on Aphyr blog posts

These are my notes on reading Kyle’s very informative series of blog posts on Jepsen to test various distributed databases. Errors are my own.


281. Jepsen: On the perils of network partitions

What is linearizability? From Linearizability versus Serializability,

“Linearizability is a guarantee about single operations on single objects. In plain English, under linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write.”

Uses a cluster for 5 machines on a LXC based virtualization to simulate the network partitions using Jepsen.

salticid, a fabric(?) like remote execution tool is used to run the tests on the cluster.

How to create a partition? cause nodes n1 and n2 to drop traffic from n3, n4, and n5.

Other errors that are used to simulate network problems:

  1. partition
  2. slow
  3. flaky (probabilistic dropping of messages)

The test itself is simple. Use 5 clients each to generate a sequence of non-overlapping list of numbers. Write these numbers to the database. See how many of these were actually saved to the database. Get the difference as the “loss”.

282. Jepsen: Postgres

PG accepts connections on a single primary node, which means, if you can’t talk to that primary node, the system is unavailable. This means PG is Consistent and has Partition tolerance (CP).

“Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.” The 2phase commit (2PC) protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome.. If the client does not get the ack from the primary node, the system has a deadlock.

Strategies for dealing with 2PC(on the db, in this case - PG), (when we suspect network partitioning..):

a. Accept false negatives – say the write failed even if there is a small chance of success b. use data structures to allow idempotent operations. This way, when you encounter network errors, you can retry the write. A queue that has “atleast once delivery” semantics is a good place to put repeatable writes that can retried later. c. write the transaction ID to the DB or a log on a disk or atleast-once-queue. Once the network partition is resolved, retry or cancel (if the transaction had succeeded in the earlier attempt).

“A network partition–and indeed, most network errors–doesn’t mean a failure. It means the absence of information.”

283. Jepsen: Redis

Redis as often used a message queue, lock service, session store, or even as database. Redis in a single-server setup is a CP system. Redis ofers asynchronous primar -> secondary replication, which means that the primary returns the response to the writing client even before the replication to the secondary is complete.

A group of nodes is also called a component.