/L8Again

Musings about distributed data infrastructure, cloud, high performance computing, and a bit of philosophy.

I am interested in high performance computing in general, and more specifically distributed systems and cloud services (admittedly biased towards AWS). I work in the data infrastructure team of a social media company that handles user generated content (UGC) for major retailer sites.

Immutable Updates - part 1

Immutability is the corner stone of our internal data store, which is backed by Cassandra. I would highly recommend to go over the immutability slides of Marz's presentation, to get a quick context on immutable updates and their immediate advantages.

In our data store, a row is basically a sequence of immutable updates, or deltas. A delta can be an update to an existing property (or column), an addition of a new property, or even a delete of the entire row or a property. For example, a document is stored in the following way in Cassandra, where each delta is a column of the row.

Δ1 { "rating": 4,
"text": "I like it."}
Δ2 { .., "status": "APPROVED" }
Δ3 { .., "status": if "PENDING" then "REJECTED" end }

Now, when a document is requested, the reader (internal system reader process, not the end user) resolves all the above deltas (Δ1, Δ2, Δ3) in that order, and produces the following document:

{ "rating": 4,
"text": "I like it.",
"status": "APPROVED" }

Notice that Δ3 is a conditional delta that says "only change the field status if the existing value is PENDING". This basically results in a no-op in our case, since the status is already changed to APPROVED. We will go into conditional deltas and its uses in Part 2 of this blog. The point to note here is that even though Δ3 does not have any effect to the resulting document ( in other words, Δ1 + Δ2 = Δ1 + Δ2 + Δ3), it is still stored in the row and serves an important purpose in conflict resolution as we will see soon. It is also easy to see, that storing a row in this format makes it easy to watch for events, and trigger applications so they can get the latest state of the row, and behave accordingly.

To design such a system in a scalable way with multiple data centers across different global regions comes with its fair share of challenges.

Immutability, eventual consistency, and conflict resolution

Immutability and eventual consistency go hand in hand. So, even if you receive deltas out of order, as long as the resolution of deltas are done in a sequence we can rest assured that eventually the document will get to a consistent state. 


Let's take an example of a document life cycle that depicts out of order deltas scenario. Check out the topology below. The right hand side is our data center in EU, and on the left hand side we have the US data center.

Consistent State

There are some applications in each region that listen on the databus to watch changes in the documents. The "Display" application on either region is a display application that displays the current state of a document. Assume that we moderate reviews for a client who is in Europe (EU), while our moderators operate in US. The "Client Portal" is used by clients that can also moderate their own reviews, and "CMS" is an app in US that is used by moderators in US.

Consider the time sequence below:

T1: Both regions have a consistent state of the review where status is "PENDING"
T2: Client in EU writes a new "delta" that sets the status to "REJECTED_CLIENT"
T3: Moderator in US updates the document with a "conditional" delta that updates status = approved, IFF status = PENDING

Let's say T2, and T3 are really close, and US doesn't see Δ2 until after it writes Δ3. This is depicted in the figure below:

Inconsistent State

Notice that in the absence of Δ2, the review gets approved and published in US - (not good, but its OK as we will see). However, when Δ3 arrives, the document will achieve a consistent state and the review will be placed in a rejected state. Since all deltas are ordered by timeUUID, we know that the conditional delta, Δ3 fails in the presence of Δ2. So, for a brief period our document was inconsistent, but achieved full consistency eventually.

Eventually consistent state

Points to note is that eventual consistency is truly achieved with the help of immutable updates. Also, note that we never had to "modify" any of our deltas. The above scenario typically results in conflict resolution, but by virtue of immutable updates, no cross-data center synchronization was needed, and neither was any conflict resolution required. Sure, the said review may have been published, but then it was taken off the website automatically.

Compaction

What if the rows get updated too often? Or even if they don't, then over time they accrue a lot of updates causing the row to become really wide. The writes will still be OK, but the read latency becomes really high as the system tries to consolidate all those deltas into one document. A row can become prohibitively wide to the point where it couldn't be resolved at read time. This is where compaction helps. As the name suggests, compaction resolves several deltas, and replaces them with one "compacted" delta. The next time the read happens, it will only see a compaction record, and the read latency issue is resolved.

Great. But there is a major challenge that comes with compaction in a multi-datacenter cluster. When is the best time to compact rows on a local node in a data center? More importantly, what if an older delta arrives after we are done compacting? If we arbitrarily decide to compact rows every five minutes, then we run the risk of losing deltas that may be in flight from a different data center.

The solution to this is figure out what deltas are fully consistent on all nodes, and only compact those deltas, which basically is to say, "Figure out what time (t) in the past, before which all deltas are available on all nodes". This t, or full consistency timestamp, can be calculated as described in my previous blog post. Full consistency timestamp assures us that no deltas will ever arrive with a time UUID before this timestamp. Thus everything before the full consistency timestamp can be compacted without any fear of data loss. So, now even in this eventually consistent world, we can deterministically handle high update rate by employing continuous partial compactions on our rows, so they never get so wide that it adversely affects read latency.