⛓ Using logs to build a solid data infrastructure

All of the difficulties related to race conditions and inconsistent data in multi-datastore infrastructures can be eliminated by leveraging a log system like Kafka. The idea is that instead of writing to multiple datastores separately, you write to a log system and your individual datastores then consume the logs to execute their actions.

“when I talk about logs here I mean something more general. I mean any kind of data structure of totally ordered records that is append-only and persistent. Any kind of append-only file.”

One caveat is that this can leave you with datastores that are not always up-to-date. You can mitigate this for data where necessary by adding a single primary database that the web service writes to. This primary database acts as a producer for the log and all other systems continue to be consumers of the log.

Article: Using logs to build a solid data infrastructure (or: why dual writes are a bad idea) – Confluent

Highlighted Excerpts

While reading the article I highlighted some passages. They are excerpted below in consecutive order and represent what I felt were key points or important takeaways from the article. Reading the excerpts (or the summary above) is not a replacement for reading the article but I hope that something in them catches your attention and encourages you to read the article.

“Logs are everywhere. I’m not talking about plain-text log files (such as syslog or log4j) – I mean an append-only, totally ordered sequence of records. It’s a very simple structure, but it’s also a bit strange at first if you’re used to normal databases. However, once you learn to think in terms of logs, many problems of making large-scale data systems reliable, scalable and maintainable suddenly become much more tractable.”

“How did we get into that state? How did we end up with such complexity, where everything is calling everything else, and nobody understands what is going on?

It’s not that any particular decision we made along the way was bad. There is no one database or tool that can do everything that our application requires – we use the best tool for the job, and for an application with a variety of features that implies using a variety of tools.”

“Simply having many different storage systems is not a problem in itself: if they were all independent from each other, it wouldn’t be a big deal. The real trouble here is that many of them end up containing the same data, or related data, but in different forms.”

“I’m not saying that this duplication of data is bad – far from it. Caching, indexing and other forms of redundant data are often essential for getting good performance on reads. However, keeping the data in sync between all these various different representations and storage systems becomes a real challenge.”

“Whenever a piece of data changes in one place, it needs to change correspondingly in all the other places where there is a copy or derivative of that data.”

“if the denormalized information is stored in a different database – for example, if you keep your emails in a database but your unread counters in Redis – then you lose the ability to tie the writes together into a single transaction. If one write succeeds, and the other fails, you’re going to have a difficult time clearing up the inconsistency.”

“with dual writes, the application has to deal with partial failure, which is difficult.”

“If you do all your writes sequentially, without any concurrency, then you have removed the potential for race conditions. Moreover, if you write down the order in which you make your writes, it becomes much easier to recover from partial failures, as I will show later.”

“when I talk about logs here I mean something more general. I mean any kind of data structure of totally ordered records that is append-only and persistent. Any kind of append-only file.”

“Even if the writes happen concurrently on the reader, the log still contains the writes in a total order. Thus, the log actually removes the concurrency from the writes – it “squeezes all the non-determinism out of the stream of writes”, and on the follower there’s no doubt about the order in which the writes happened.”

“rather than having the application write directly to the various datastores, the application only appends the data to a log (such as Kafka). All the different representations of this data – your databases, your caches, your indexes – are constructed by consuming the log in sequential order.

Each datastore that needs to be kept in sync is an independent consumer of the log. Every consumer takes the data in the log, one record at a time, and writes it to its own datastore. The log guarantees that the consumers all see the records in the same order; by applying the writes in the same order, the problem of race conditions is gone. This looks very much like the database replication we saw earlier!”

“Going via a database also means that you don’t need to trust the log as your system of record (which may be a scary prospect if it’s implemented with a new technology) – if you have an existing database that you know and like, and you can extract a change log from that database, you can still get all the advantages of a log-oriented architecture.”

“if you want to build a new derived datastore, you can just start a new consumer at the beginning of the log, and churn through the history of the log, applying all the writes to your datastore. When you reach the end, you’ve got a new view onto your dataset, and you can keep it up-to-date by simply continuing to consume the log!”