Tuesday, February 3, 2009

Thoughts on Replication from 2 Feb SFO MeetUp (slides too)

Last night I gave a presentation on Tungsten Replicator to the SFO MySQL MeetUp. It was really fun--lots of excellent questions and ideas from an audience that knows MySQL in and out. Here are slides from the talk from our trusty S3 document repository.

There were many things to think about on the BART ride home to Berkeley but here are a couple that really struck me, in part because I think we can do something about them.

First, database replication needs to expose as much monitoring data as possible. The kind of data you get or can infer from SHOW SLAVE STATUS is just the beginning. This came through really strongly from people like Erin who are using load-balancers and other automated techniques for distributing load to replicas using decision criteria like availability and latency of slaves. We have heard this from people like Peter Zaitsev as well--it's starting to sink in. :)

Second, it's critical to think through the database failover problem fully. There seem to be two axes to this problem. The first is what you might call call a vertical axis where you think about a single component in the system--here the database. You have to cover not just swapping replication flow but also enabling/disabling writes, triggers, and batch jobs, as well as application specific tasks. (Lots of good comments from the audience here as well.)

The other failover "axis" is horizontal where you think about the ensemble of databases, proxies, applications, and utilities that make up the cluster. The issues to cover here including sending commands in parallel, ensuring everyone receives them, and dealing with a wide range of possible failure conditions ranging from network partitions to failure of individual OS-level commands. We plan to unveil a solution shortly in the form of the Tungsten Manager, which uses group communications to attack this problem. I can't wait to get feedback on that.

p.s., Baron Schwartz yesterday posted a very interesting article on cache issues with failover on InnoDB. Just another example of how broad the failover problem really is.

Sunday, February 1, 2009

Simple HA with PostgreSQL Point-In-Time Recovery

Point-in-time recovery or PITR is one of my favorite PostgreSQL features. Most database servers have a transaction log for recovery. (MySQL is one of the few exceptions; more on that shortly.) However, PostgreSQL PITR allows and even encourages users to manipulate the logs to perform some very useful tasks:

* Creating a bit-for-bit backup of a server without halting
* Restoring a server to a specific point in time
* Creating a warm standby server

The third task is especially interesting because it's so common. One of the most pronounced trends in computing is the decreasing cost of computing power through Moore's law improvements and virtualization. Standby solutions nowadays look like a good investment compared to the cost having a server down or worse yet losing a lot of data.

PostgreSQL PITR makes warm standby relatively easy to set up. You enable log archiving, which copies the completed WAL files to an archive directory, then transfer those files to a host where another PostgreSQL server is in continuous recovery mode. This is easier to understand with a picture, so here it is.

Two key points: first, warm standby is based on log files, so as long as you do not mess up the setup, your recovered data will always be consistent. Second, failover is really fast, because your standby server is in continuous recovery mode and generally will be on the last file or two, hence almost ready to go at all times.

There are some details but they are not very hard. The first step is to enable archiving. This copies log buffers into an archive directory from the standard pg_xlogs directory. Archiving is easy to enable in postgresql.conf, as shown by the following example:
archive_mode = on               # enable archiving
archive_command = 'cp -v %p /data/pgsql/archives/%f'
archive_timeout = 300 # timeout to close buffers
Next you need to transfer the files as they appear in the archive directory. For that you can use rsync, as shown below. Put it in a cron job on the standby so that files transfer automatically between hosts every few minutes.
rsync -avz --remove-sent-files prod1:/data/pgsql/archives/ \
/data/pgsql/archives/ > /dev/null
Finally, you copy the files into the pg_xlogs directory of a PostgreSQL server that is running recovery. This is the most complicated part. To make a long story short, you set up a carbon copy of server data on the standby and then create a file called recovery.conf with a command to copy the files from the archive directory to pg_xlogs. That command is usually pg_standby, a handy little program written by Simon Riggs of 2nd Quadrant. Here's a recovery.conf file with typical pg_standby configuration.
restore_command = '/usr/local/pgsql/bin/pg_standby
-l -d -s 2 -t /tmp/pgsql.trigger /data/pgsql/archives
%f %p %r 2>>standby.log'
Basically pg_standby blocks until it either finds the next log file requested by the server or sees a file named /tmp/pgsql.trigger. Then recovery is done and your standby server pops up and is open for business.

The PostgreSQL docs provide a good overview of warm standby. However, you need to get the details exactly right for recovery to work without losing data. I omitted these from the description but wrote a script called standby.sh to illustrate. It avoids some common gotchas in setup such as not cleaning up directories on the standby server or leaving the pg_standby trigger file lying around for unsuspecting programs to trip over. I'm sure there are better ways but this works for me.

Warm standby is thus a standard HA technique used in many PostgreSQL installations. It should be used in many more. PostgreSQL 8.4 is adding "hot standby", which allows users to open up the standby server for reads, similar to Oracle Active Data Guard, a pricey but very flexible availability solution for Oracle.

There's also an effort afoot to add synchronous replication of log changes to the standby, but it looks as if that will be delayed until PostgreSQL 8.5. If you really want to protect against data loss you can replicate the pg_logs directory using DRBD, then mount and copy any missing log files during failover.

So as you can see, PostgreSQL has a really nice solution to warm standby that is fast and ensures data consistency with options to eliminate data loss almost completely. Can MySQL do this? The answer is "kind of." The problem is that MySQL servers do not have a transaction log. Logging is the responsibility of individual storage engines. Some, like MyISAM, do not even have a log. The closest equivalent in MySQL is the binlog, which is used for replication.

In fact, a lot of people use MySQL master/master replication to implement warm standby. You can in theory reduce data loss by copying master binlogs over to another host and having that server read from them to distributed "missed" events to the slave(s). This is complicated and there are some two-phase commit problems with MySQL binlogs that mean sooner or later there will be data inconsistencies. Most MySQL users either just accept the chance of data loss or use DRBD to sync the entire master over to another host.

On balance, the PostgreSQL warm standby solution using PITR compares very favorably to MySQL as well as many commercial databases. Two things that would help are better management and proxies to help redirect reads to hot standby servers, once these are available. Our development team is working on both of these. As usual, stay tuned...

P.S., In response to some good comments from readers the comparison between MySQL and PostgreSQL replication deserves a fuller treatment in a separate article. PostgreSQL gives you a bit-for-bit copy of data, whereas MySQL makes a logical copy using SQL. PostgreSQL warm standby is therefore transparent to applications, whereas MySQL replication has some prerequisites for SQL schema and operations that applications must observe.

Scaling Databases Using Commodity Hardware and Shared-Nothing Design