Aug 30, 2011

Practical Multi-Master Replication using Shard Filters

Earlier this month I published an article on this blog describing the system of record approach to multi-master replication.  As mentioned in that article my colleagues and I at Continuent have been working on improving Tungsten to make system of record design patterns easier to implement.  This article describes how to set up system of record using Tungsten Replicator shard filters, which are a new feature in Tungsten 2.0.4.  By doing so we will create a multi-master configuration that avoids replication loops and transaction conflicts.  On top of that, it is quite easy to set up.

There are many possible system of record patterns depending on how many schemas are shared and across how many masters.  The following diagram shows three of them.  In contrast to many so-called MySQL multi-master implementations, all masters are live and accept updates.  (By contrast, schemes such as MySQL-MHA make extra masters read-only.  Don't be fooled!)  

For today's exercise we will implement the basic system of record.  Once you understand this you can quickly set up other multi-master scenarios.  

Defining Shard Master Locations

The first step is to tell Tungsten where each shard is mastered.  By mastered we mean it is the one master that receives application updates on that shard, whereas all other masters have copies only or may not even contain the shard at all.  Tungsten uses a variant of CSV (comma-separated format) where the first line contains column names.  You can have any amount of whitespace between entries.  Create a file called shards.map with your favorite editor and type in the following lines. 

shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme         sjc     false
pinnacle        nyc     false

The first column is the name of the shard.  This must be unique--because a shard can only live on one master.  The next column is the "home" master for the shard.  This is the one and only master that should receive shard updates.  The third column defines whether the shard is critical and requires full serialization.  It will be linked to parallel replication in a later release.  

It turns out you do not need to add entries for Tungsten catalog schemas such as tungsten_nyc.  Tungsten Replicator will create them automatically.  They are shown here for completeness only.  

Creating Replication Services

Next we need to define services to replicate bi-directionally between DBMS servers and set options to filter shards using the ShardFilter class, which is new in Tungsten 2.0.4.  The shard filter helps ensure that shards replicate from their home masters only and not from other locations.   If you do not know what replication services are, you can find a description of them in this article.  

Multi-master replication is easy to mis-configure, so to prevent accidents we will tell the shard filter to generate an error any time it processes a shard it has never seen before.  The replication service will immediately fail, which signals that we have to update shard definitions.  This is the safest way to implement system of record or any multi-master configuration for that matter.  It is generally easier to restart replication after correcting the configuration than to mix up data, which can lead to major outages.  

The first step is to set replication services for each master.  These read the binlog and make transactions available to slave replication services.  Here are the commands.   Note that the sjc master is on host logos1, while the nyc master is on logos2.  The remaining examples use these names consistently. 

# Define common master settings. 
COMMON_MASTER_OPTS="--datasource-user=tungsten --datasource-password=secret \
 --home-directory=/opt/continuent --svc-parallelization-type=disk \
 --svc-extractor-filters=shardfilter \
 --property=replicator.filter.shardfilter.unknownShardPolicy=error"

# Set up sjc master. 
tools/tungsten-installer --master-slave -a --master-host=logos1 \ 
--cluster-hosts=logos1 --service-name=sjc $COMMON_MASTER_OPTS --start-and-report 

# Set up nyc master. 
tools/tungsten-installer --master-slave -a --master-host=logos2 \
--cluster-hosts=logos2 --service-name=nyc $COMMON_MASTER_OPTS --start-and-report 

The --svc-extractor-filters option adds shard filtering immediately after event extraction.  The unknownShardPolicy=error setting will cause the masters to die if they process an undefined shard.  

Now we can define the remote slave services for sjc and nyc.  These are special slaves that write transactions onto another master as opposed to a normal slave.  We would like slave services to error out on unknown shards as well.  Also (and this is important) we want them to enforce shard homes.  Here are the commands to create the services and start each one.  

COMMON_SLAVE_OPTS="--release-directory=/opt/continuent/tungsten \
--service-type=remote --allow-bidi-unsafe=true --svc-parallelization-type=disk \
--svc-applier-filters=shardfilter \
--property=replicator.filter.shardfilter.unknownShardPolicy=error \
--property=replicator.filter.shardfilter.enforceHome=true"

# Set up sjc remote slave. 
tools/configure-service -C -a --host=logos2 \
--local-service-name=nyc --role=slave \
--datasource=logos2 --master-host=logos1 $COMMON_SLAVE_OPTS sjc
$trepctl -host logos2 -service sjc start

# Set up nyc remote slave. 
tools/configure-service -C -a --host=logos1 \
--local-service-name=sjc --role=slave \
--datasource=logos1 --master-host=logos2 $COMMON_SLAVE_OPTS nyc
trepctl -host logos1 -service nyc start

The --svc-applier-filters option adds shard filtering before applying to the DBMS.  The unknownShardPolicy=error setting will cause the slaves to die if they process an undefined shard.  Finally, the enforceHome=true option means that each slave will drop any transaction that lives on a different service from that slave's master.  

At the end of this procedure, your services should be online and read to run.  Use 'trepctl services' to make sure.  

Loading Shard Definitions

To make shard definitions take effect, you must load the shard.map contents into each replication service.  You can do this any time the replicator is running but after loading new definitions you must put the replicator online again.  Here are the commands to load the shard maps onto each of the four replication services.   For each replication service, you must delete the old definitions, reload new ones, and get the replicator to go online again. 

#!/bin/bash
MAP=shard.map
for host in logos1 logos2
do
  for service in sjc nyc
  do
    trepctl -host $host -service $service shard -deleteAll
    trepctl -host $host -service $service shard -insert < $MAP
    trepctl -host $host -service $service offline
    trepctl -host $host -service $service wait -state OFFLINE
    trepctl -host $host -service $service online
  done
done

This looks a little clunky and will be reduced to a single command instead of five in a later release.  I put it in a script to make it quicker to run.  The good news is that there is just one shard map that works for all replication services, regardless of location or role.  

Once you finish this step, you can go to any replication service and list the shards it knows about.  Let's pick a service and demonstrate: 

$ trepctl -host logos1 -service sjc shard -list
shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme sjc false
pinnacle nyc false

With this we are ready to start processing some transactions. 

Multi-Master Operation

At this point we have multi-master replication enabled between hosts logos1 and logos2.  You can try it out.  Let's add the acme database to the sjc master on logos1 as an example. 

mysql -utungsten -psecret -hlogos1 
mysql> create database acme;
mysql> use acme
mysql> create table foo (id int);
mysql> insert into foo values(1);

We can see that all of these commands replicate over to the logos2 server quite easily with the following command: 

mysql -utungsten -psecret -hlogos2 -e 'select * from acme.foo'
+------+
| id   |
+------+
|    1 | 
+------+

That seems pretty simple.  In fact it is.  You can go over to logos2 and enter transactions for pinnacle in the same way.  Data replicate back and forth.  There are no replication loops.  There are also no conflicts.  

Adding a New Shard

So what happens when we add a new shard?  The simplest way to see is to create a database using a schema name that does not exist in the shard map.   Let's try to create a database named superior on the nyc master.  

mysql -utungsten -psecret -hlogos2 -e 'create database superior'

Now check the status of the nyc master replication service.  We see it has failed with an error due to the unknown shard.   (Tungsten parses the create database command and assigns it the shard ID "superior.") 

$ trepctl -host logos2 -service nyc status
Processing status command...
NAME                     VALUE
----                     -----
...
pendingError           : Stage task failed: binlog-to-q
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000157:0000000000002475;1287
pendingErrorSeqno      : 8
pendingExceptionMessage: Rejected event from unknown shard: seqno=8 shard ID=superior
...
state                  : OFFLINE:ERROR...
Finished status command...

This problem is quite easy to fix.  We just open up the shard.map file and add a row for superior so that the file contents look like the following: 

shard_id master critical
tungsten_nyc nyc false
tungsten_sjc sjc false
acme         sjc     false
pinnacle        nyc     false
superior        nyc     false

Reload the shard.map file as shown previously and you will be back in business.  Incidentally, if you do not want the superior database to be replicated to other masters, you can also specify this in the rules.  Just give superior the special master name #LOCAL as in the following example and it will not replicate outside the nyc service. 

superior        #LOCAL  false

In fact, #LOCAL means that any schema named superior will not replicate outside the service in which it is defined.  You can have an unshared schema named superior on every master.  

Where to Next?

The shard support described in this article is now part of Tungsten 2.0.4 and will appear in the official build when it is finally ready.  You can try it out right now using one of our handy nightly builds.  

We plan to build out shard filtering quite a bit from the current base.  One immediate fix is to put in a check so that if an application commits shard updates on the wrong DBMS server, the master replication service on that server will detect it and fail.  This will tell you there's a problem immediately rather than letting you wallow in blissful ignorance while your data become hopelessly mixed up.  We will also simplify the commands to update shards while replicators are online. 

Longer term we will be adding features to propagate shard definitions through replication itself.  Stay tuned for more work in this area.  If you want to help fund work to enable your own applications, please get in contact with me at Continent.  I can think of at least a dozen ways to make our multi-master support better but it's always nicer to spend the effort on features that enable real systems.  In the meantime, I hope you find multi-master with shard filtering useful and look forward to your feedback.  

Aug 24, 2011

First the Blog, now the Webinar: Adding Parallel Replication to MySQL in a Hurry

My recent post on setting up Tungsten parallel replication in a hurry got a lot of hits, though to be fair it was probably not the great writing but the fact that at least one popular MySQL blog posted a link to it.  (Thanks, we like you guys too.)  Anyway, I would like to invite anybody who is interested in parallel replication to attend a webinar on Thursday September 1st at 10am PDT to cover installing and using Tungsten.  It's straight-up technical talk to help you start quickly. 

Bringing up Tungsten on an existing MySQL slave only takes a few minutes, so once we have that out of the way I will explain how Tungsten works inside and show you some of the tricks for getting your applications to play nice with parallel replication as well as how to tune performance.  The idea is to minimize fluffy architectural stuff and maximize lab demos that help you bend replication to your will.  The talk will also cover how to get help, log bugs, and even add your own code.  Plus there will be lots of time for questions.

As most readers of this blog know, Tungsten Replicator is open source (GPL V2) and hosted on code.google.com.  If you miss the webinar you may be able to catch up on parallel replication in-person in London toward the end of October.  I just submitted a talk to the next Percona Live and hope it gets accepted.  If so, see you there!

Aug 19, 2011

The System of Record Approach to Multi-Master Database Applications

Multi-master database systems that span sites are an increasingly common requirement in business applications.  Yet the way such applications work in practice is not quite what you would think from accounts of NoSQL systems like Cassandra or SQL-based systems like Oracle RAC.  In this article I would like to introduce a versatile design pattern for multi-master SQL applications in which individual schemas are updated in a single location only but may have many copies elsewhere both locally as well as on other sites.  This pattern is known as a system of record architecture.  You can build it with off-the-shelf MySQL and master/slave replication.

Let's start by picking a representative software-as-a-service (SaaS) application:  call center automation.    Call center software integrates with a local PBX or VOIP to allow agents to answer and make phone calls for telemarketing campaigns in a systematic and automated way using standard procedures known as "agent scripts." Admins set up agent scripts and define lists of people to call as well as marketing campaigns.  Finally and perhaps most importantly, managers receive a wide variety of detailed reports that allow them to optimize current work, examine historical performance, and make predictions about the future for planning purposes.  Here is a typical application architecture.
Figure 1:   Call Center Application Architecture
Bear in mind that this is a greatly simplified view.  Like most business applications, call center automation systems may contain hundreds of database tables and many types of user services.  There are also practical complications that go beyond the application itself.    Call center automation is vital to the businesses that use it.  Customers want assurance they can continue processing on another site if a SaaS vendor site goes dark.  This means we have to think about maintaining applications and data on multiple sites.

The ideal solution for most SaaS vendors would be to have call center data and applications for all customers live on multiple sites at all times.  Multiple live sites mean that failover is instantaneous since both applications and database servers are already up and running.  Constant update means there is little or no data loss on failure. Customers could connect to the nearest site.  Here is a picture of that dream that includes two sites and two customers, Acme Inc. and Pinnacle, Ltd.

Figure 2:  Dream Architecture for Call Center Automation

This solution has only one problem.  It is impossible to build.  Readers may nod wisely and say this is because of CAP Theorem limitations, but that somewhat misses the point.  Let's say we use a NoSQL DBMS like Cassandra that permits updates in multiple locations and reconciles the data using eventual consistency.   However, there's a catch:  as we saw above, much of the value of call center automation is in operational queries and reports.  That drives you back to an RDBMS with cross-table joins, aggregation functions, referential integrity, and convenient SQL-based report writing tools.  For this reason alone, Cassandra is a non-starter for call center automation.

What about a SQL DBMS?  MySQL obviously has all the features you need for query-intensive solutions on smallish data sets (e.g. hundreds of millions of rows, not many billions or trillions).  The problem is multi-master replication.   Updating the same table from two or more places on a LAN is already quite difficult:  witness the complexity of Oracle RAC or MySQL Cluster.  The problem becomes intractable when you combine complex SQL transactions, referential integrity, and high-latency WAN connections.  If you want full SQL semantics you cannot have updates on multiple sites.   This is a serious dilemma and not just for call center automation.  The same problems or worse affect a multitude of valuable business applications including market automation, credit card processing, customer relationship management (CRM), time/expense tracking, accounting, and many others.

Fortunately we are not really stuck.  If we give up some requirements customers do not really want anyway, there is a perfectly good solution that will work for a wide range of problems.  Data warehousing architects long ago developed the notion of a system of record.  Bill Inmon's classic Building the Data Warehouse defined system of record as follows:
The definitive and singular source of operational data.  If data element abc has a value of 25 in a database record but a value of 45 in the system of record, by definition the first value is incorrect and must be reconciled.  
System of record applies to multi-master systems in the form of a simple rule.  We just assert that every customer has master data in one and only one location and copies everywhere else.  When particular customers update information they do so on their own master.  Customers can have masters on different hosts or sites, but the system of record rule says that no customer has one in two places.  This eliminates conflicts between masters, and multi-master replication now works without a lot of difficulty.

System of record thus meets the original requirement of having data on multiple sites, which was to handle a site failure.  We can store data economically using off-the-shelf MySQL.  We can update copies within and across sites using master/slave replication.  We can shard customer data into independent schemas. The result looks like the following.  Acme has a master in San Jose, whereas Pinnacle has its master in New York.

System of Record Architecture for Call Center Automation
Using the system of record approach simplifies other problems as well.  Standard backup and restore techniques still work.  If you mess up a customer copy, you re-provision from the master shard.  You can implement failover across sites and also fail over locally onto slave copies, which can be complete copies containing data for all customers.

Meanwhile, most users are fine with a single site.  Pinnacle is close to New York, which is why the SaaS vendor puts Pinnacle's data there and gives them the New York site DNS for login.   It is also possible to run reports on the cross-site copies as well.  You can even run full applications provided you forward writes to the system of record, as shown above for Acme.

The real issue in implementing system of record architectures is that existing replication and clustering tools are not quite up to the job of handling cross-site applications build on system of record.  We are extending Tungsten to handle some of the obvious problems in building these types of systems using MySQL.
  1. Locating the customer master and connecting applications to it. 
  2. Moving the customer master from one location to another.  This happens more often than you would think, for example to minimize multi-master replication which can introduce problems beyond conflicts. 
  3. Detecting accidental updates to copies and preventing them from either reaching the DBMS and/or preventing them from propagating to other locations.  
  4. Proving a clean failover model that works on both cross-site as well as local copies of data.  
  5. Recovering corrupted copies of customer data from the master.  
I will discuss two of the upcoming Tungsten features in follow-up articles.  The first is assigning a shard master in the Tungsten Replicator using the new shard API.  The shard API enables multi-master but enforces system of record constraints to avoid messing up data should you accidentally update in the wrong location.  The second feature is cross-site management and connectivity using Tungsten Enterprise.  This handles failovers within and between sites and automatically connects applications to the active master regardless of which site or DBMS it lives on.

The need for availability is pushing an increasing number of SaaS vendors and other application providers to operate systems across multiple sites.  Applications like call center automation depend on the features of SQL and cannot be implemented using NoSQL DBMS's like Cassandra.  The system of record architecture eliminates replication conflicts and enables multi-master updates to work on ordinary SQL databases between sites.  If you are building complex SQL applications and thinking about going multi-site, this design pattern should be in your toolbox.  

Aug 18, 2011

So Where's the Fall MySQL Community Conference?

Last week Percona announced plans to sponsor the Percona MySQL Conference in Santa Clara in April 2012.  It is meant to replace the O'Reilly conferences of previous years.  The announcement led to some reasonable questions, for example from Giuseppe Maxia.  These and other online posts initiated a thoughtful exchange of views about the pros and cons of Percona's conference announcement by various members of the MySQL community.

Not everyone agrees with what Percona is doing.  However, you have to give Peter, Baron, and others at Percona credit for taking the risk to organize a replacement conference.  It's a big financial commitment to rent space in Santa Clara.  The conference will be a huge amount of work, much of it quite thankless.  Even so I imagine it might be tempting for some people to try to organize an anti-conference just to spite Percona, much as Percona did at the MySQL Conference in 2009.  This would be a mistake.

Here is a better suggestion.  Why not put the energy into organizing a Fall 2012 MySQL Community Conference in Europe?   Make it somewhere pleasant like Barcelona or Verona and everybody will want to go.  Two solid conferences in different locations at opposite ends of the calendar would benefit the MySQL community at all levels and make it possible for more people to attend.  For those of you who want to grow the community, it's time to stand up and be counted.  Let's get another conference off the ground.

Aug 13, 2011

Adding Parallel Replication to MySQL in a Hurry

A previous article on this blog described Tungsten parallel replication using on-disk queues.  On-disk queues are now more or less finished, and I just closed the covering issue for the feature.  The work is bug fixing and performance testing from here on out.  Speaking of performance, that looks fairly good.   A recent on-site test using production workloads showed 3.3X improvement over native MySQL replication while holding resources like memory down to much more reasonable levels than in-memory queues.  We have further optimizations on the way, so this should improve.

Now that parallel replication is working a lot better, what is it good for?  Here is a good start:  assuming your workload is suitable for shard-based replication, Tungsten offers a nice replacement for MySQL slave replication that will give you immediate parallel replication for any MySQL server starting with version 5.0 and up.  We call this native slave takeover.  It looks like the following diagram.

You can migrate MySQL slaves to Tungsten very easily thanks to some great work on the Tungsten installer by Jeff Mace that Giuseppe Maxia also praised in a recent article.  Seeing that Giuseppe is also the Continuent QA director, I guess this means it passed.  :)   Here is how to set up.

First, configure MySQL master/slave replication.  Here are typical commands after you synchronize the slave with a backup from the master.  For these examples, we assume the slave is running on host mercury, and the master is on host saturn.

mysql -utungsten -psecret -hmercury -e "CHANGE MASTER TO \
  MASTER_HOST='saturn', \
  MASTER_USER='repl', \
  MASTER_PASSWORD='s3cr3t', \
  MASTER_LOG_FILE='mysql-bin.000338', \
  MASTER_LOG_POS=4534"
mysql -utungsten -psecret -hmercury -e "START SLAVE"
mysql -utungsten -psecret -hmercury -e "SHOW SLAVE STATUS\G"

Next, download and unpack Tungsten from the latest dev build.  Here are sample commands for build 231.  

wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.4-231.tar.gz
tar -xvzf tungsten-replicator-2.0.4-231.tar.gz
cd tungsten-replicator-2.0.4-231

Finally, run the installation using the following handy command.  This will copy the download into its proper location (here /opt/tungsten) and start Tungsten.  Tungsten will stop MySQL replication if it is running and start Tungsten replication from the exact point where native replication left off.
tools/tungsten-installer \
  --direct  \
  --native-slave-takeover \
  --master-host=saturn  \
  --master-user=repl  \
  --master-password=s3cr3t  \
  --slave-host=mercury \
  --slave-user=tungsten  \
  --slave-password=secret  \
  --service-name=takeover \
  --home-directory=/opt/continuent \
  --svc-parallelization-type=disk \
  --channels=10 \
  --start-and-report
If there is a problem with your pre-requisites the installer will print out a message and stop.  Most of these messages should be pretty self-explanatory.   (In some builds there's a problem with OpenJDK failing the install check.  Use --no-validation to get around that.) 

Here is another interesting tip.  If you try Tungsten and want to go back to MySQL replication, you can flip back to MySQL native replication with two commands.  Just take Tungsten offline cleanly, which shuts down all the replicator channels at the same point, and start MySQL replication again: 

trepctl offline
mysql -urepl -ps3cr3t -e 'START SLAVE'

We are currently doing a lot of performance profiling and testing of parallel replication.  I hope to post more about performance results in a future article.  Meanwhile, there are 36 new features (including on-disk queues) and bug fixes that will roll into a final Tungsten 2.0.4 in the next week or so.   Try out the latest builds and see what you think.