A system is said to be fully consistent, when all nodes in a distributed system have the same state of the data. So, if record A is in State S on one node, then we know that it is in the same state in all its replicas.

Now that we are on the same page, let's answer the following questions.

“When is the last time my system was fully consistent?”

The correct answer is - maybe never. Maybe your throughput is so massive, there may be no point in time when you can say a system is fully consistent. But, the correct question to ask is maybe this:

“When is the last time before which I can be assured that all updates were fully consistent on all nodes?”

That is a slightly different question. So, while my system may not be fully consistent at this very moment, how far back do I have to go in time when all the updates are consistent? In other words, what is the full consistency lag?

To me, this is a vital metric for an AP system. In the practical world there is always a tolerance level for how long a particular row can remain inconsistent. Facebook, for example, can tolerate eventually consistent comments on a given story for only so long, until its users start questioning the wisdom of it all. So, while your buddy may not see the comment you posted instantly, he may see it 2 seconds later than you see it - which is fine. However, 10 minutes? Maybe, that's ok too, but now some people need to be alerted. So, the question arises what are your SLAs, and how can you make sure you keep your promises to your customer. Promises, such as, your system will be fully consistent within, say 5 minutes.

I was actually surprised when I couldn't find anything similar out-of-the-box. So, here is one way of calculating full consistency lag on Cassandra:

T_f = Time no hints were found on any node
rpc_timeout = Maximum timeout in cassandra that nodes will use when communicating with each other.

FCL = Full Consistency Lag

FCL = T_f - rpc_timeout

The concept of Hinted Handoffs was introduced in Amazon's dynamo paper as a way of handling failure. This is what Cassandra leverages for fault-tolerant replication. Basically, if a write is made to a replica node, which is down, then Cassandra will write a "hint" to the coordinator node, and try again in a configured amount of time.

We exploit this feature of cassandra to get us our full consistency lag. The main idea is to poll all the nodes to see if they have any pending hints for other nodes. The time when they all report zero (T_f) is when we know that there are no failed writes, and the only pending writes are those that are in flight. So, subtracting the cassandra timeout (rpc_timeout) will give us our full consistency lag.

Now, that we have our full consistency lag, this metric can be used to alert the appropriate people when the cluster is lagging too far behind.

Finally, you would want to graph this metric for monitoring.

Screen Shot 2014-02-04 at 6.15.47 AM.png

Note that in the above graph I artificially add a 5 minute lag to our rpc_timeout value. So, the ideal value should always be at 300 seconds. We, then, periodically poll for full consistency check every 300 seconds (or 5 minutes). You should tweak this value according to your needs. So, for our settings, the expected lag should remain consistent at 5 minutes, but you can see it spike at 10 minutes. All that really says is there was one time when we checked and found a few hints. The next time we checked (after 5 minutes in our case) all hints were taken care of. You can now set an alert in your system that should wake people up if this lag violates a given threshold - something that makes sense for your business.

/L8Again

Full consistency lag for eventually consistent systems

FCL = Tf - rpc_timeout

FCL = T_f - rpc_timeout