Aug 5, 2012

Is Synchronous Data Replication over WAN Really a Viable Strategy?

Synchronous data replication over long distances has the sort of seductive appeal that often characterizes bad ideas.  Why wouldn't you want every local credit card transaction simultaneously stored on the other side of the planet far away from earthquake, storms and human foolishness?  The answer is simple: conventional SQL applications interact poorly with synchronous replication over wide area networks (WANs).

I spent a couple of years down the synchronous replication rabbit hole in an earlier Continuent product.  It was one of those experiences that make you a sadder but wiser person.  This article digs into some of the problems with synchronous replication and shows why another approach, asynchronous multi-master replication, is currently a better way to manage databases connected by long-haul networks.

Synchronous Replication between Sites

The most obvious problem with any form of synchronous replication is the hit on application performance.  Every commit requires a round-trip to transfer and acknowledge receipt at the remote site, which in turn reduces single-thread transaction rates to the number of pings per second between sites.  As a nice article by Aaron Brown demonstrates, you can show the effect easily using MySQL semi-synchronous replication between hosts in Amazon regions.  Aaron's experiment measured 11.5 transactions per second, or about 100 times less than single-thread performance on a local master between hosts with 85 millisecond latency.  At that rate you would theoretically expect transaction throughput of ~11.7 transactions per second (1000 / 85 = 11.7), so the agreement between practice and theory is very close.  It's great when science works out like this.

You might argue that applications could tolerate the slow rate assuming it were at least constant.  Sadly that's not the case for real systems.  Network response varies enormously between sites in ways that are quite easy to demonstrate.

To illustrate variability I set up an Amazon m1.small instance in the us-east-1 region (Virginia) and ran 24 hours of ping tests to instances in us-west-2 (Oregon) and ap-southeast-1 (Singapore).   As the following graph shows, during a 4 hour period ping times to Singapore remain within a band but vary up to 10%.  Ping times to Oregon on the other hand hover around 100ms but spike up randomly to almost 200ms.  During these times, synchronous replication throughput would be cut by 50% to approximately 5 transactions per second.

Amazon ping times from us-east-1 to us-west-2 and ap-southeast-1 (240 minute interval)
Moreover, it's not just a question of network traffic.  Remote VMs also become busy, which slows their response.  To demonstrate,  I ran two-minute sysbench CPU tests to saturate processing on the us-west-2 instance while observing the ping rate.  Here is the command:

sysbench --test=cpu --num-threads=10 --max-time=120 --cpu-max-prime=1000000 run

As the next graph illustrates, CPU load has a unpredictable but substantial effect on ping times.  As it happens, the ping variation in the previous graph may be due to resource contention on the underlying physical host.   (Or it might really be network traffic--you never really know with Amazon.)

Effect of sysbench runs on ping times to US-West (20 minute interval)
Slow response on resource-bound systems is a problem that is familiar to anyone with experience with distributed systems, including systems where everything is on a single LAN.  You cannot even count on clock ticks being delivered accurately within various types of virtual machines.  The timing delays are magnified in WANs, as they already have high latency to begin with.  Between busy hosts and network latency, it's reasonable to expect that at some point most systems would at least briefly experience single session transaction rates of 1 transaction per second or less.  Even with parallelized replication you would see substantial backups on the originating DBMS servers as commits begin to freeze.

To complete the tale of woe, failures of various kinds can cause remote hosts to stop responding at all for periods of time that vary from seconds to days.  Amazon is generally quite reliable but had two outages in the Virginia data center in June 2012 alone that brought down applications from hours to days.  If you replicate synchronously to a host affected by such an outage, your application just stops and you no longer store transactions at all, let alone securely.  You need to turn off synchronous replication completely to stay available.

So is synchronous replication really impossible between sites?  Given the problems I just described it would be silly to set up MySQL semi-synchronous replication between over WAN for a real application.  However, there are other ways to implement synchronous replication.  Let's look at two of them.

First, there is Galera, which uses a distributed protocol called certification-based replication to agree on commit order between all cluster nodes combined with execution of non-conflicting transactions in parallel.  Certification-based replication is a great algorithm in many ways, but Galera comes with some important practical limitations.  First it replicates rows rather than statements.  The row approach handles large transactions poorly, especially over distances, due to the large size of change sets.  Also, not all workloads parallelize well, since transactions that conflict in any way must be fully serialized.   Overall DBMS throughput may therefore reduce to the single-session throughput discussed above at unexpected times due to variations in workload.  Finally, full multi-master mode between sites (as opposed to master/slave) is likely to be very problematic as nodes drop out of the cluster due to transient communication failures and require expensive reprovisioning.  This is a general problem with group communications, which Galera depends on to order transactions.

Second, there are theoretical approaches that claim many of the benefits of synchronous replication without killing throughput or availability.  One example is the Calvin system developed by Daniel Abadi and others, which seeks to achieve both strong transaction consistency and high throughput when operating across sites.  The secret sauce in Calvin is that it radically changes the programming model to replicate what amount to transaction requests while forcing actual transaction processing to be under control of the Calvin transaction manager, which orders transaction order in advance across nodes.   That should at least in principle reduce some of the unpredictability you may see in systems like Galera that do not constrain transaction logic.  Unfortunately it also means a major rewrite for most existing applications.  Calvin is also quite speculative.  It will be some time before this approach is available for production systems and we can see whether it is widely applicable.

There's absolutely a place for synchronous replication in LANs, but given the current state of the art it's hard to see how most applications can use effectively it to link DBMS servers over WAN links.  In fact, the main issue with synchronous replication is the unpredictability it introduces into applications that must work with slow and unreliable networks.  This is one of the biggest lessons I have learned at Continuent.

The Alternative:  Asynchronous Multi-Master Replication

So what are the alternatives?  If you need to build applications that are available 24x7 with high throughput and rarely, if ever, lose data, you should consider high-speed local clusters linked by asynchronous multi-master replication between sites.  Here is a typical architecture, which is incidentally a standard design pattern for Tungsten.

Local clusters linked by asynchronous, multi-master replication
The big contrast between synchronous and and asynchronous replication between sites is that while both have downsides, you can minimize asynchronous multi-master problems using techniques that work now.  Let's look at how async multi-master meets requirements and the possible optimizations.
  1. Performance. Asynchronous replication solves WAN performance problem as completely as possible.  To the extent that you use synchronous or near-synchronous replication technologies it is on local area networks, which are extremely fast and reliable, so application blocking is minimal.  Meanwhile, long-haul replication can be improved by compression as well as parallelization, because WANs offer good bandwidth even if there is high end-to-end latency. 
  2. Data loss.  Speedy local replication, including synchronous and "near synchronous" methods, minimizes of data loss due to storage failures and configuration errors.   Somewhat surprisingly you do not need fully synchronous replication for most systems even at the local level--that's a topic for a future blog article--but replication does need to be quite fast to ensure local replicas are up-to-date.  Actually, one of the big issues for avoiding local data loss is to configure systems carefully (think sync_binlog=1 for MySQL, for example).  
  3. Availability.  Async multi-master systems have the delightful property that anything interrupts transaction flow between sites, replication just stops and then resumes when the problem is corrected.  There's no failover and no moving parts.  This is a major strength of the multi-master model.  
So what are the downsides?  Nothing comes for free, and there are at least two obvious issues.
  1. Applicability.  Not every application is compatible with asychronous multi-master.  You will need to do work on most existing applications to implement multi-master and ensure you got it right.  I touched on some of the MySQL issues in an earlier article.  If multi-master is not a possibility, you may need the other main approach to cross-site replication:  a system-of-record design where applications update data on a single active site at any given time with other sites present for disaster recovery.  (Tungsten also does this quite well, I should add.) 
  2. Data access.  While you might not lose data it's also quite likely you might not be able to access it for a while either.  It's rare to lose a site completely but not uncommon for sites to be inaccessible for hours to days.  The nice thing is that with a properly constructed multi-master application you will at least know that the data will materialize on all sites once the problem is solved.  Meanwhile, relax and work on something else until the unavailable site comes back. 
In the MySQL community local clusters linked by asynchronous multi-master are an increasingly common architecture for credit card payment gateways, which I mentioned at the beginning of this article.  This is a telling point in favor of asynchronous cross-site replication, as credit card processors have a low tolerance for lost data.

Also, a great deal of current innovation in distributed data management is headed in the direction of asynchronous mechanisms.  NoSQL systems (such as Cassandra) tend to use asynchronous replication between sites.  There is interesting research afoot, for example in Joe Hellerstein's group at UC Berkeley, to make asynchronous replication more efficient by accurately inferring cases where no synchronization is necessary.  Like other research, this work is quite speculative, but the foundations are in use in operational systems today.  

For now the challenge is to make the same mechanisms that NoSQL systems have jumped on work equally well for relational databases like MySQL.  We have been working on this problem for the last couple of years at Continuent.  I am confident we are well on the way to solutions that are as good as the best NoSQL offerings for distributed data management.