Apr 30, 2012

If You *Must* Deploy Multi-Master Replication, Read This First

An increasing number of organizations run applications that depend on MySQL multi-master replication between remote sites.   I have worked on several such implementations recently.  This article summarizes the lessons from those experiences that seem most useful when deploying multi-master on existing as well as new applications.

Let's start by defining terms.  Multi-master replication means that applications update the same tables on different masters, and the changes replicate automatically between those masters.  Remote sites mean that the masters are separated by a wide area network (WAN), which implies high average network latency of 100ms or more.  WAN network latency is also characterized by a long tail, ranging from seconds due to congestion to hours or even days if a ship runs over the wrong undersea cable.

With the definitions in mind we can proceed to the lessons.  The list is not exhaustive but includes a few insights that may not be obvious if you are new to multi-master topologies.  Also, I have omitted issues like monitoring replication, using InnoDB to make slaves crash-safe, or provisioning new nodes.  If you use master/slave replication, you are likely familiar with these topics already.

1. Use the Right Replication Technology and Configure It Properly

The best overall tool for MySQL multi-master replication between sites is Tungsten.  The main reason for this assertion is that Tungsten uses a flexible, asynchronous, point-to-point, master/slave replication model that handles a wide variety of topologies such as star replication or all-to-all.  Even so, you have to configure Tungsten properly.  The following topology is currently my favorite:
  • All-to-all topology.  Each master replicates directly to every other master.  This handles prolonged network outages or replication failures well, because one or more masters can drop out without breaking  replication between the remaining masters or requiring reconfiguration.  When the broken master(s) return, replication just resumes on all sides.  All-to-all does not work well if you have a large number of masters.  
  • Updates are not logged on slaves.  This keeps master binlogs simple, which is helpful for debugging, and eliminates the possibility of loops.  It also requires some extra configuration if the masters have their own slaves, as would be the case in a Tungsten Enterprise cluster
There are many ways to set up multi-master replication replication, and the right choice varies according to the number of masters, whether you have local clustering, or other considerations.  Giuseppe Maxia has described many topologies, for example here, and the Tungsten Cookbook has even more details.

One approach you should approach with special caution is MySQL circular replication.  In topologies of three or more nodes, circular replication results in broken systems if one of the masters fails.  Also, you should be wary of any kind of synchronous multi-master replication across sites that are separated by more than 50 kilometers (i.e. 1-2ms latency).  Synchronous replication makes a siren-like promise of consistency but the price you pay is slow performance under normal conditions and broken replication when WAN links go down.

2. Use Row-Based Replication to Avoid Data Drift

Replication depends on deterministic updates--a transaction that changes 10 rows on the original master should change exactly the same rows when it executes against a replica.  Unfortunately many SQL statements that are deterministic in master/slave replication are non-deterministic in multi-master topologies.  Consider the following example, which gives a 10% raise to employees in department #35.

   UPDATE emp SET salary = salary * 1.1 WHERE dep_id = 35;

If all masters add employees, then the number of employees who actually get the raise will vary depending on whether such additions have replicated to all masters.  Your servers will very likely become inconsistent with statement replication.  The fix is to enable row-based replication using binlog-format=row in my.cnf.  Row replication transfers the exact row updates from each master to the others and eliminates ambiguity.

3. Prevent Key Collisions on INSERTs

For applications that use auto-increment keys, MySQL offers a useful trick to ensure that such keys do not  collide between masters using the auto-increment-increment and auto-increment-offset parameters in my.cnf.  The following example ensures that auto-increment keys start at 1 and increment by 4 to give values like 1, 5, 9, etc. on this server.

server-id=1
auto-increment-offset = 1
auto-increment-increment = 4

This works so long as your applications use auto-increment keys faithfully.  However, any table that either does not have a primary key or where the key is not an auto-increment field is suspect.  You need to hunt them down and ensure the application generates a proper key that does not collide across masters, for example using UUIDs or by putting the server ID into the key.   Here is a query on the MySQL information schema to help locate tables that do not have an auto-increment primary key. 

SELECT t.table_schema, t.table_name 
  FROM information_schema.tables t 
    WHERE NOT EXISTS 
      (SELECT * FROM information_schema.columns c
       WHERE t.table_schema = c.table_schema  
         AND t.table_name = c.table_name
         AND c.column_key = 'PRI'
         AND c.extra = 'auto_increment')

4. Beware of Semantic Conflicts in Applications

Neither Tungsten nor MySQL native replication can resolve conflicts, though we are starting to design this capability for Tungsten.  You need to avoid them in your applications.  Here are a few tips as you go about this.

First, avoid obvious conflicts.  These include inserting data with the same keys on different masters (described above), updating rows in two places at once, or deleting rows that are updated elsewhere.  Any of these can cause errors that will break replication or cause your masters to become out of sync.  The good news is that many of these problems are not hard to detect and eliminate using properly formatted transactions.  The bad news is that these are the easy conflicts.  There are others that are much harder to address.  

For example, accounting systems need to generate unbroken sequences of numbers for invoices.  A common approach is to use a table that holds the next invoice number and increment it in the same transaction that creates a new invoice.  Another accounting example is reports that need to read the value of accounts consistently, for example at monthly close.  Neither example works off-the-shelf in a multi-master system with asynchronous replication, as they both require some form of synchronization to ensure global consistency across masters.  These and other such cases may force substantial application changes.  Some applications simply do not work with multi-master topologies for this reason. 

5. Remove Triggers or Make Them Harmless

Triggers are a bane of replication.  They conflict with row replication if they run by accident on the slave.  They can also create strange conflicts due to weird behavior/bugs (like this) or other problems like needing definer accounts present.  MySQL native replication turns triggers off on slaves when using row replication, which is a very nice feature that prevents a lot of problems.  

Tungsten on the other hand cannot suppress slave-side triggers.  You must instead alter each trigger to add an IF statement that prevents the trigger from running on the slave.  The technique is described in the Tungsten Cookbook.  It is actually quite flexible and has some advantages for cleaning up data because you can also suppress trigger execution on the master.  

You should regard all triggers with suspicion when moving to multi-master.  If you cannot eliminate triggers, at least find them, look at them carefully to ensure they do not generate conflicts, and test them very thoroughly before deployment.  Here's a query to help you hunt them down: 

SELECT trigger_schema, trigger_name 
  FROM information_schema.triggers;

6. Have a Plan for Sorting Out Mixed Up Data

Master/slave replication has its discontents, but at least sorting out messed up replicas is simple: re-provision from another slave or the master.  No so with multi-master topologies--you can easily get into a situation where all masters have transactions you need to preserve and the only way to sort things out is to track down differences and update masters directly.   Here are some thoughts on how to do this.
  1. Ensure you have tools to detect inconsistencies.  Tungsten has built-in consistency checking with the 'trepctl check' command.  You can also use the Percona Toolkit pt-table-checksum to find differences.  Be forewarned that neither of these works especially well on large tables and may give false results if more than one master is active when you run them.  
  2. Consider relaxing foreign key constraints.  I love foreign keys because they keep data in sync.  However, they can also create problems for fixing messed up data, because the constraints may break replication or make it difficult to go table-by-table when synchronizing across masters.  There is an argument for being a little more relaxed in multi-master settings. 
  3. Switch masters off if possible.  Fixing problems is a lot easier if you can quiesce applications on all but one master.  
  4. Know how to fix data.  Being handy with SQL is very helpful for fixing up problems.  I find SELECT INTO OUTFILE and LOAD DATA INFILE quite handy for moving changes between masters.  Don't forget SET SESSION LOG_FILE_BIN=0 to keep changes from being logged and breaking replication elsewhere.  There are also various synchronization tools like pt-table-sync, but I do not know enough about them to make recommendations.  
At this point it's probably worth mentioning commercial support.  Unless you are a replication guru, it is very comforting to have somebody to call when you are dealing with messed up masters.  Even better, expert advice early on can help you avoid problems in the first place.

(Disclaimer:  My company sells support for Tungsten so I'm not unbiased.  That said, commercial outfits really earn their keep on problems like this.)

7. Test Everything

Cutting corners on testing for multi-master can really hurt.  This article has described a lot of things to look for, so put together a test plan and check for them.  Here are a few tips on procedure:
  1. Set up a realistic pre-prod test with production data snapshots.  
  2. Have a way to reset your test environment quickly from a single master, so you can get back to a consistent state to restart testing. 
  3. Run tests on all masters, not just one.  You never know if things are properly configured everywhere until you try. 
  4. Check data consistency after tests.  Quiesce your applications and run a consistency check to compare tables across masters. 
It is tempting to take shortcuts or slack off, so you'll need to find ways to improve your motivation.  If it helps, picture yourself explaining to the people you work for why your DBMS servers have conflicting data with broken replication, and the problem is getting worse because you cannot take applications offline to fix things.  It is a lot easier to ask for more time to test.  An even better approach is to hire great QA people and give them time to do the job right.

Summary

Before moving to a multi-master replication topology you should ask yourself whether the trouble is justified.  You can get many of the benefits of multi-master with system-of-record architectures with a lot less heartburn.  That said, an increasing number of applications do require full multi-master across multiple sites.  If you operate one of them, I hope this article is helpful in getting you deployed or improving what you already have.

Tungsten does a pretty good job of multi-master replication already, but I am optimistic we can make it much better.  There is a wealth of obvious features around conflict resolution, data repair, and up-front detection of problems that will make life better for Tungsten users and reduce our support load.  Plus I believe we can make it easier for developers to write applications that run on multi-master DBMS topologies.  You will see more about how we do this in future articles on this blog.

Apr 22, 2012

Replication Is Bad for MySQL Temp Tables

Experienced MySQL DBAs know that temp tables cause major problems for MySQL replication.  It turns out the converse is also true:  replication can cause major problems for temporary tables.

In a recent customer engagement we enabled Tungsten Replicator with a MySQL application that originally ran on a server that did not use replication.  QA promptly discovered reports that previously ran in 10 seconds were now running in as many minutes.  It turned out that the reports used temp tables to assemble data, and these were being written into the master binlog.  This created bloated binlogs and extremely slow reports.  We fixed the problem by enabling row replication (i.e., binlog-format=row in my.cnf).

A common DBA response to temp table problems is to try to eliminate them completely, as suggested in the excellent High Performance MySQL, 3rd Edition. (See p. 502.)  Elimination is a good philosophy when applications use temp tables to generate updates.  However, it does not work for reporting.  Temp tables allow you to stage data for complex reports across a series of transactions, then pull the final results into a report writer like JasperReports.  This modular approach is easy to implement and maintain afterwards.  Eliminating temp tables in such cases can create an unmaintainable mess.

The real solution with report temp tables is to keep them out of the master binlog.  Here is a list of common ways to do so.  Let me know if you know others.

* Turn off binlog updates.  Issue 'SET SESSION SQL_LOG_BIN=0' when generating reports.  The downside is that it requires SUPER privilege to set.  Also, if you make a code mistake and update normal tables with this setting, your changes will not be replicated.

* Use a non-replicated database.  Configure the master my.cnf with binlog-ignore-db as follows to ignore any update (including on temp tables) that is issued when database 'scratch' is the default database:

binlog_ignore_db = scratch

This approach does not require special privileges.  However coding errors or connection pool misconfigurations are obvious liabilities.  Your application must either connect to the scratch database or issue an explicit use command. Otherwise, temp table operations will be logged, as in the following example:

use not_scratch;
create temporary table scratch.report1_temp(name varchar(256), entry_time date, exit_time date);

* Use a slave with the binlog disabled.  Remove the log-bin option from my.cnf.  This works well if you have extra reporting slaves that are caught up.  However, it may not work if the reports must be fully up-to-date or you need the ability to promote the slave quickly to a master, in which case the binlog must be enabled.  

* Use row replication.  You can set row replication at the session level using 'SET SESSION binlog_format=row', which requires SUPER privilege, or overall by setting binlog-format in my.cnf.  In this case CREATE TEMPORARY TABLE and updates on temp tables do not appear in the binlog at all.  The downside of enabling row replication fully is that it can lead to bloated logs and blocked servers if you have very large transactions.  SQL operations like DELETE that affect multiple rows are stored far more compactly in statement replication.  Also, reloading mysqldump files can be very slow in row replication compared to statement replication, which can handle block inserts generated by the --extended-insert option.

The proper solution to keep replication from hurting your use of temp tables will vary depending on your application as well as the way you run your site.  For my money, though, this is a good example of where row replication really helps and deserves a closer look.  

MySQL could use some feature improvements in the area of temp tables and replication.  I find it surprising that mixed mode replication does not fully suppress temp table binlog updates.  Only row replication does so.   Second, it would be great to have a CREATE TABLE option to suppress logging particular tables to the binlog.  This would allow applications to make the logging decision at schema design time.  Finally, global options to suppress binlogging of specific table types, such as temp tables and MEMORY tables would be useful.  Perhaps we will see some of these in future MySQL releases.  

Apr 14, 2012

Oracle Missed at MySQL User Conference...Not!

The MySQL UC this past week was the best in years.   Percona did an outstanding job of organizing the main Percona Live event that ran Tuesday through Thursday.  About 1000 people attended, which is up from the 800 or so at the O'Reilly-run conference in 2011.  There were also excellent follow-on events on Friday for MariaDB/SkySQL, Drizzle, and Sphinx.

What made this conference different was the renewed energy around MySQL and the number of companies using it.  
  1. Big web properties like Facebook, Twitter, Google, and Craigslist continue to anchor the MySQL community and drive innovation from others through a combination of funding,  encouragement, and patches. 
  2. Many new companies we have not heard from before like Pinterest, BigDoor, Box.net, and Constant Contact talked about their experience building major new applications on MySQL.  
  3. The vendor exhibition hall at Percona Live was hopping.  Every vendor I spoke to had a great show and plans to return next year.  There is great innovation around MySQL from many creative companies.  I'm very proud my company, Continuent, is a part of this. 
  4. The demand for MySQL expertise was completely out of hand.  So many talks ended with "...and we are hiring" that it became something of a joke.  The message board was likewise packed with help wanted ads.  
When Oracle acquired Sun Microsystems a couple of years ago, it triggered a lot of uncertainty about the future of MySQL.  This concern turns out to be unfounded.  Oracle does excellent engineering work, especially on InnoDB, but had no involvement either official or unofficial at the conference.  This was actually a good thing.  

By not participating, Oracle helped demonstrate that MySQL is no longer dependent on any single vendor and has taken on a real life of its own driven by the people who use it. MySQL fans owe Oracle a vote of thanks for not attending this year. Next year I hope they will be back to join the fun.

p.s., It has come to my attention since writing this article that 800 may not be correct attendance for the O'Reilly 2011 conference.  The 1000 figure is from Percona.  Speaking as an attendee they seemed about the same size.  Please feel free to comment if you have accurate numbers.  

Apr 3, 2012

Solving the Cloud Database Memory Conundrum

Cloud databases have a memory problem.   Continuent has been doing a lot of Amazon deployments lately, and it is becoming apparent that memory utilization in those environments is more than just an inconvenience.  In this article I would like to discuss the memory problem that we see in customer implementations and some new features of Tungsten Enterprise that help alleviate it for MySQL.

The Cloud Memory Problem and Database Arrays

As I discussed in a recent article about prefetch, the amount of RAM allocated to the InnoDB buffer pool is one of the principle determinants of MySQL performance.  The speed difference between using a page in the buffer pool vs. getting it from storage is commonly about 100:1 on RAIDed disk. The other determinant is working set size, i.e., the percentage of pages that need to be memory-resident for adequate performance. Working set size is variable and depends on your query patterns as well as the level of indexing on tables. These two variables set a limit to the amount of data you can manage and still access information quickly. Here is a table that shows the relationship.


Max GB of manageable storage
Buffer Pool Size (GB)
5% resident
10% resident
25% resident
50% resident
15
300
150
60
30
30
600
300
120
60
60
1200
600
240
120
120
2400
1200
480
240

The largest instance currently available from Amazon EC2 is a Quadruple Extra Large, which offers 68GB of RAM.  Let's assume we allocate 85% of that to the buffer pool, which is about 58GB.   We'll be generous and say 60GB so it matches my table. Assuming our application performs reasonably with 10% of pages resident we can then manage a maximum of 600GB of stored data per server.

The fact that EC2 instances are rather small understates the difficulty with RAM in Amazon.  You also need more of it.  Amazon EBS is slower than typical on-premise storage, such as your friendly direct-attached RAID, and has highly variable performance as Baron Schwartz pointed out.  You might therefore be inclined to double the working set ratio to 20%, since I/O is so costly you need a bigger working set to minimize reads from storage.  Quadruple Extra Large instances then max out at 300Gb of managed data.  I'm not forgetting SSDs, which are slowly appearing in cloud environments.  They alter the working set ratios but introduce other issues, for example related to cost and capacity.  It's also not clear how much faster they really are in shared environments like Amazon.

This simple storage math makes it obvious that managing data in the cloud requires a mental shift from fewer but larger DBMS servers to groups of much smaller servers, which we can describe as database arrays. Many small SaaS businesses generate a Terabyte of customer data in a year, hence would provision 4 new 300GB servers annually.  Mobile and market automation apps generate many Terabytes per year.  I recently did a capacity calculation for a customer where we hit 50 servers without even considering replicas to ensure availability.   

Database arrays therefore address memory by partitioning storage.   Many applications already partition into shards for other reasons.  In this case, the main question is how the applications find data easily.  Locating "virtualized databases" is a non-trivial problem, especially when you consider that servers may fail, move, or go offline for maintenance.  The more servers you have, the more likely the special cases become.  This is where Tungsten Enterprise comes in. 

Tungsten Enterprise 1.5 Features for Cloud Database Arrays

Tungsten Enterprise builds clusters by taking a set of off-the-shelf DBMS servers linked by replication and making them look like a single virtualized DBMS server to applications.   We call this server a data service.  The key to the single-DBMS illusion is the Tungsten Connector, a fast proxy that sits between applications and database hosts.   It routes connections into the right type of server, for example a master for writes or a slave for reads.  It also allows you to switch masters without stopping applications or losing data.

Tungsten Enterprise already offers several nice properties for cloud operation.  The Connector is software-only and does not use VIPs, which are not supported in Amazon anyway.  Also, the fact that you can switch master and slaves without stopping applications makes it possible to "escape" slow EBS volumes.  That said, previous Tungsten versions limited connections to a single data service per connector, which made database arrays hard to support properly.

Tungsten Enterprise 1.5 adds two features to help with constructing and managing arrays.  First, our colleague Jeff Mace created a great Ruby-based install program called tpm that can install complex cluster topologies from the command line.  If you have used the tungsten-installer program on Tungsten Replicator, you have an idea what is on the way. (If not Jeff will be talking about installations at the Percona Live MySQL Conference on April 11.)   Efficient command-line installations make it easy to set up and manage a lot of clusters in parallel.

Second, thanks to work by Ed Archibald and Gilles Rayrat, the Connector now supports multi-service connectivity.  Connectors previously kept track of cluster state by connecting to one of a list of managers for the local data service.   We extended this to allow Connectors to track state in multiple data services. We developed the feature to support disaster recovery sites, which are another Tungsten 1.5 feature.  As it turns out, though, multi-service connectivity is extremely useful for database arrays in the cloud.  The following diagram compares single and multi-data service connectivity.


As with a single data service the Connector receives state changes providing location of the data service master and slaves as well as whether each is online or offline.  The Connector uses a simple routing scheme based on logins to route SQL connections to the correct server for whatever it is doing.  Beyond that, Tungsten Enterprise takes care of switching masters and load balancing reads that are necessary to use an array efficiently.  

Setting up Multi-Service Connectivity

Multi-service access is trivial to set up.  This is documented more fully in the forthcoming Tungsten Enterprise 1.5 documentation, but here is the basic idea.

First, each Tungsten Enterprise installation contains a file called dataservices.properties in the cluster-home/conf directory.  The Connector uses this file to locate data service managers in order to get cluster state changes.  To enable multi-service connectivity, just ensure there are entries for the managers of each data service in your array before starting the local Connector service.

cluster1=pr1c1.bigdata.com,pr2c1.bigdata.com,pr3c1.bigdata.com
cluster2=pr1c2.bigdata.com,pr2c2.bigdata.com,pr3c2.bigdata.com
cluster3=pr1c3.bigdata.com,pr2c3.bigdata.com,pr3c3.bigdata.com
...

Second, the Connector uses a file called user.map locate in tungsten-connect/conf to define logins.  Edit this file and add a different login for each cluster to user.map.  Save the file and the connector will pick it up automatically.  

# user password service
cluster1 secret cluster1
cluster2 secret cluster2
cluster3 secret cluster3
...

That's about it.  Applications need to make one change, namely to use a different login for each data service.  For some applications, that is quite simple and can be done in a few minutes.  I found to my surprise that some customer applications even do it already, which is very convenient.  For others, it may require more extensive changes.   The good news is that improvements are on the way.  

What's Next

Tungsten Enterprise 1.5 is a solid step toward enabling large-scale data management using MySQL database arrays.  However, we can do much more.  Obviously, it would be easier if the Connector routed to the right DBMS server using the database name on the connect string or by detecting use commands in SQL.  That's on the way already.

More generally, the whole point of Tungsten is to make off-the-shelf DBMS work better by adding redundancy and optimizing use of resources without forcing users to migrate or rearrange data.  Memory is a key resource in cloud environments and deserves special attention.  Tungsten connectivity looks like a powerful complement to the caching that already occurs within databases.  It's also a natural extension of the work we have been doing to speed up replication.

Looking to the future, we are investigating more intelligent load balancing that sends queries to servers where the pages they need are most likely to be already in the buffer pool.  This type of optimization can double or triple effective buffer pool size in the best cases.  Query caching also offers potential low-hanging fruit.  Finally, management parallelization is critical for array operation.  I foresee a long list of Tungsten improvements that are going to be fun to work on and that go a long way to solving the cloud database memory conundrum.

p.s., Continuent has a number of talks scheduled on Tungsten-related topics at the Percona Live MySQL Conference and the SkySQL MariaDB Solutions Day April 10-13 in Santa Clara.  For more on Tungsten Enterprise and EC2 in particular, check out Ed Archibald's talk on April 11.  Giuseppe Maxia and I will also touch on these issues in our talk at at the SkySQL/MariaDB Solutions Day.

Apr 1, 2012

Disproving the CAP Theorem

Since the famous conjecture by Eric Brewer and proof by Nancy Lynch et al., CAP has given the world countless learned discussions about distributed systems and many a well-funded start-up.  Yet who truly understands what CAP means?  Even a cursory survey of the blogosphere shows profound disagreement about the meaning of terms like CP, AP, and CA in real systems.  Those who disagree on CAP include some of the most illustrious personages of the database community.

We can therefore state with some confidence that CAP is confusing. Yet this observation itself raises deeper questions.  Is CAP merely confusing?  Or is it the case that as with other initially accepted but now doubtful ideas like the Copernican model, evolution, and continental drift, that CAP is actually not correct?  Thoughtful readers will agree this question has not received anywhere near the level of scientific scrutiny it deserves.

Fortunately for science private citizens like me have been forging ahead without regard to the opinions of so-called experts or even common sense.  My work on CAP relies on two trusted analytic tools of database engineers over the legal drinking age: formal logic and beer.  Given the nature of the problem we should obviously use a minimum of the former and a maximum of the latter.  We have established that CAP is confusing.  To understand why we must now deepen our confusion and study its habits carefully.  Other investigators have used this approach with great success.

Let us begin by translating the terms of CAP into the propositional calculus.  The terms C (consistency), A (availability) and P (partition tolerance) can be used to state the famous "two out of three" of CAP using logical implication as shown below.

(1) A and P => not C
(2) P and C => not A
(3) C and A => not P

So far so good.  We can now dispense briefly with logic and turn to confusion.  It seems there is difficulty distinguishing the difference between CA and CP systems, i.e., that they are therefore equivalent. This is a key insight, which we can express formally as follows:

(4) C and A <=> C and P

which further reduces to

(5) A <=> P

In short our confusion has led us directly to the invaluable result that A and P, hence availability and partition tolerance, are exactly equivalent!  I am sure you share my excitement at the direction this work is taking.  We can now through a trivial substitution of A for P in equation 2 above reveal the following:

(6) A and C => not A
(7) C => (A => not A)

We have just shown that consistency implies that any system that is available is also unavailable simultaneously.  This is an obvious contradiction, which means the vast logical edifice on which CAP relies crumbles like a soggy nacho.  Considering the amount of beer consumed at the average database conference it is surprising nobody thought of this before.

At this point we can now raise the conversation up a level from looking for spare change under the table and comment on the greater meaning of our results in the real world.  Which is the following: Given the way most of us programmers write software it's a wonder CAP is an issue at all. Honestly, I can't even get calendar programs to send invitations to each other across time zones.  I plan to bring the combustible analytic capabilities of logic and beer to bear on the mystery of time at a later date.  For now we can just speculate it is due to a mistaken design based on CAP.