Dec 26, 2009

Proving Master/Slave Clusters Work and Learning along the Way

2009 has been a big year for Tungsten. In January we had (barely) working replication for MySQL. It had some neat features like global IDs and event filters, but to be frank you needed imagination to see the real value. Since then, Tungsten has grown into a full-blown database clustering solution capable of handling a wide range of user problems. Here are just a few of the features we completed over the course of the year:
  • Autonomic cluster management using business rules to implement auto-discovery of new databases, failover, and quick recovery from failures
  • Built-in broadcast monitoring of databases and replicators
  • Integrated backup and restore operations
  • Pluggable replication management, proven by clustering implementations based on PostgreSQL Warm Standby and Londiste
  • Multiple routing mechanisms to provide seamless failover and load balancing of SQL
  • Last, but not least, simple command line installation to configure and start Tungsten in minutes
You can see the results in our latest release, Tungsten 1.2.1, which comes in both open source and commercial flavors. (See our downloads page to get software as well as documentation.)

In the latter part of 2009 we also worked through our first round of customer deployments, which was an adventure but helped Tungsten grow enormously. Along the way, we confirmed a number of hunches and learned some completely new lessons.
  • Hardware is changing the database game. In particular, performance improvements are shifting clustering in the direction of loosely coupled master/slave replication rather than tightly coupled multi-master approaches. As I laid out in a previous article, the problem space is shifting from database performance to availability, data protection, and utilization.
  • Software as a Service (SaaS) is an important driver for replication technology. Not only is the SaaS sector growing, but even small SaaS applications can result in complex database topologies that need parallel, bi-directional, and cross-site replication, among other features. SaaS business economics tend to drive building these systems on open source databases like MySQL and PostgreSQL. By supporting SaaS, you support many other applications as well.
  • Cluster management is hard but worthwhile. Building distributed management with no single points-of-failure is a challenging problem and probably the place where Tungsten still has the most work to do. Once you get it working, though, it's like magic. We have been focused on trying to make management procedures not just simple but wherever possible to do away with them completely by making the cluster self-managing.
  • Business rules rock. We picked the DROOLS rule engine to help control Tungsten and make it automatically reconfigure itself when data sources appear or fail. The result has been an incredibly flexible system that is easy to diagnose and extend. Just one example: floating IP address support for master databases took 2 hours to implement using a couple of new rules that work alongside the existing rule set. If you are not familiar with rules technology, there is still time to make a New Year's resolution to learn it in 2010. It's powerful stuff.
  • Clustering has to be transparent. I mean really transparent. We were in denial on this subject before we started to work closely with ISPs, where you don't have the luxury of asking people to change code. Tungsten Replicator is now close to a drop-in replacement for MySQL replication as result. We also implemented proxying based on packet inspection rather than translation and re-execution to raise throughput and reduce incompatibilities visible to applications.
  • Ship integrated, easy-to-use solutions. We made the mistake of releasing Tungsten into open source as a set of components that users had to integrate themselves. We have since recanted. As penance we now ship fully integrated clusters with simple installation procedures even for open source editions and are steadily extending the installations to cover not just our own software but also database and network configuration.
Beyond the features and learning experiences the real accomplishment of 2009 was to prove that integrated master/slave clusters can solve a wide range of problems from data protection to HA to performance scaling. In fact, what we have implemented actually works a lot better than I expected when we began to design the system back in 2007. (In case this sounds like a lack of faith, plausible ideas do not not always work in the clustering field.) If you have not tried Tungsten, download it and see if you share my opinion.

Finally, keep watching Tungsten in 2010. We are a long way from running out of ideas for making Tungsten both more capable and easier to use. It's going to be a great year.

Oct 31, 2009

Replicating from MySQL to Drizzle and Beyond

Drizzle is one of the really great pieces of technology to emerge from the MySQL diaspora--a lightweight, scalable, and pluggable database for web applications. I am therefore delighted that Marcus Erikkson has published a patch to Tungsten that allows replication from MySQL to Drizzle. He's also working on implementing Drizzle-to-Drizzle support, which will be very exciting.

Marcus has submitted the patch to us and I have reviewed the code. It's quite supportable, so I plan to integrate it as soon as we are done with our next Tungsten release, which will post around 5 November. You will be able to build and run it using our new community builds.

This brings up a question--what about replicating from MySQL to PostgreSQL? What about other databases? I get the PostgreSQL replication question fairly often but it may be a while before our in-house team can implement plug-in support for it. Anybody want to submit a patch in the meantime? Post in the Tungsten forums if you have ideas and need help to get the work done. Tungsten Replicator code is very modular and it is not hard to add new database support.

Meanwhile, go Marcus!!

Community Builds for Tungsten Clustering

It's been almost two months since I have posted anything on the Scale-Out Blog, as our entire team has been heads-down working on Tungsten. We now have a number of accomplishments that are worth writing articles about. Item one on that list is community builds for Tungsten clusters.

Tungsten community builds offer a bone-simple process to check out and build Tungsten clustering software. The result is a fully integrated package that includes replication, management, monitoring, and SQL routing. The community builds work for MySQL 5.0 and 5.1 and also allow you to set up basic replication from MySQL to Oracle.

Community builds do not include much logic for autonomic management, including automated failover and sophisticated rules that keep databases up and running rain or shine. Those and other features like floating IP address support are part of the commercial Tungsten software. PostgreSQL and Oracle-to-Oracle support is also commercial only at least for the time being.

Community builds do include our standard installation process, which allows you to set up a working cluster a few minutes. You can back up and restore datebases, check liveness of cluster members, failover master databases for maintenance and a lot of other handy features. There is also full documentation, located here.

To get started, you need a host running Mac OS X, Linux, or Solaris that meets the following prerequisites. On Linux you can usually satisfy these requirements using Yum or Apt-get if the required software is not already there.
  • Java JDK 1.5 or higher.
  • Ant 1.7.0 or higher for builds
  • Subversion. We use version 1.6.1
  • MySQL 5.0 or 5.1 (only on hosts where cluster is installed)
  • Ruby 1.8.5 or greater (only on hosts where cluster is installed)
Now you can grab the software and do a build. Make a work directory, cd to it, and enter the following commands. (Due to truncation on the blog the SVN URL looks a little funny. Don't be fooled.)
svn checkout \
https://tungsten.svn.sourceforge.net/\
svnroot/tungsten/trunk/community

cd community
./release-community.sh # (Press ENTER when prompted)
The release-community.sh script checks out most of the Tungsten code for you and does a build. IMPORTANT NOTE: The command shown above builds SVN HEAD, which means you may have a life of adventure. You can also build off branches which are more or less stable. Look at the available config files in the community directory.

After the build finishes, you have ready-to-install clustering software. You can scp the resulting tar.gz file out to another host or just cd directly into the build itself as shown below and run the configure script, which sets up Tungsten software on a single host.
cd build/tungsten-community-2009-1.2
./configure
You may need to read the manuals so you get all the answers right. The installation manual is posted here at www.continuent.com. You'll also need to look at the Replication Guide, Chapter 2 to see how to set up MySQL properly. We'll do that automatically in the future, but for now it's help yourself. (Don't worry: the database set-up is easy.)

To make the cluster interesting you should install on at least a couple of hosts. Here's what an installed cluster looks like using the Tungsten cluster control (cctrl) program.
[tungsten@centos5a tungsten-community-2009-1.2]$ tungsten-manager/bin/cctrl
[LOGICAL] /cluster/comm/> ls

COORDINATOR[centos5a:MANUAL]

ROUTERS:
+-----------------------------------------------------------------------+
|NONE |
+-----------------------------------------------------------------------+

DATASOURCES:
+-----------------------------------------------------------------------+
|centos5a(master:ONLINE, progress=3) |
+-----------------------------------------------------------------------+
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+-----------------------------------------------------------------------+

+-----------------------------------------------------------------------+
|centos5b(slave:ONLINE, progress=3, latency=0.0) |
+-----------------------------------------------------------------------+
| REPLICATOR(role=slave, master=centos5a, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+-----------------------------------------------------------------------+
Starting from scratch and pulling code from SourceForge, it takes me about 30 minutes to get to an installed cluster with two nodes. At this point you have access to a very powerful set of tools to protect data, keep your databases available, and scale performance. Look at the manuals. Try it out. If you have questions or feedback, post them in the Tungsten forums. In the meantime, have fun with your database cluster.

p.s., We will post binary builds next week. The current build is in final release checks, so you may notice a few problems--I hit a Ruby warning on configuration that will be fixed shortly.

Sep 5, 2009

Tungsten Replicator 1.0.3 Release

Tungsten Replicator version 1.0.3 is now released and available as a download from Source Forge. Tungsten Replicator provides advanced, platform-independent replication for MySQL 5.0/5.1 with global transaction IDs, crash-safe slaves, flexible filtering, and built-in consistency checking. The 1.0.3 release adds backup and restore, which I described in a previous blog article.

In addition, there are numerous small feature editions and some great bug fixes that raise performance and stability for large-scale deployment. For example, the replicator now goes online in seconds even when there are millions of rows in the history table. This fixes our previous go-online performance which was, er, pretty slow. Thanks to our users in the Continuent forums for helping us to track down this problem as well as several others.

As of the 1.0.3 release we are also starting to offer the enterprise documentation for the open source replicator. I think this provides better documentation all around, not least of all because we can do a better job of maintaining a single copy. Get current replicator documentation here.

Sep 1, 2009

The Future of Database Clustering

Baron Schwartz started a good discussion about MMM use cases that quickly veered into an argument about clustering in general. As Florian Haas put it on his blog, this is not just an issue of DRBD vs. MySQL Replication. Is a database cluster something you cobble together through bits and pieces like MMM? Or is it something integrated that we can really call a cluster? This is the core question that will determine the future of clustering for open source databases.

I have a strong personal interest in this question, because Tungsten clustering, which I designed, is betting that the answer is changing in two fundamental ways. First, the problems that clustering solves are evolving, which will in turn will lead to significant changes in off-the-shelf clusters. Second, for most users the new clusters will be far better than solutions built from a bunch of individual pieces.

To see why, let's start with some history of the people who use open source databases and why they have been interested in clustering over the last decade or so. Open source databases have a wide range of users, but there are a couple of particularly significant groups. Small- to medium-sized business applications like content management systems are a very large segment. Large web facing applications like Facebook or GameSpot are another. Then there are a lot of custom applications that are somewhere in between--too big to fit on a single database dual- or quad-core server but completely satisfied with the processing power of 2 to 4 servers.

For a long time all of these groups of users introduced clusters for two main reasons: ensuring availability and raising performance. Spreading processing across a cluster of smaller commodity machines was a good solution to both requirements and explains the enormous popularity of MySQL Replication as well as many less-than-successful attempts to implement multi-master clustering. However the state of the art has evolved in a big way in the last couple of years.

The reason for change is simple: hardware. Multi-core architectures, cheap DRAM, and flash memory are changing not just the cost of databases but the fundamental assumptions of database computing. Pull out your dog-eared copy of Transaction Processing by Gray and Reuter, and have a look at the 1991 price/performance trade-offs for memory inside the front cover. Then look at any recent graph of DRAM and flash memory prices (like this one). For example, within a couple of years it will be practical to have even relatively large databases on SSDs. Assuming reasonable software support random reads and writes to "disk" will approach main memory speeds. Dirt-cheap disk archives are already spread across the Internet. The old graph of costs down to off-line tape has collapsed.

Moreover, open source databases are also starting to catch up with the hardware. In the MySQL community both MySQL 5.4 and Drizzle are focused on multi-core scaling. PostgreSQL has been working on this problem for years as well. Commercial vendors like Schooner are pushing the boundaries with custom appliances that integrate new hardware better than most users can do it themselves and add substantial database performance improvements to boot.

With better multi-core utilization plus cheap memory and SSDs, the vast majority of users will be able to run applications with adequate performance on a single database host rather than the 2 to 4 nodes of yore. In other words, performance scaling is rapidly becoming a non-issue for a larger and larger group of users. These user don't need infinite performance any more than they need infinite features in a word processing program. What's already there is enough, or will be within the next year or two.

Performance is therefore receding as a motivation for clustering. Meanwhile, here are three needs that will drive database clustering of open source SQL databases over the next few years.
  1. Availability. Keeping databases alive has always been the number one concern for open source database users, even back in the days when hosts and databases were less capable. This is not a guess. I have talked to hundreds of them since early 2006. Moreover most users just don't have the time to cover all the corner cases themselves and want something that just works without a lot of integration and configuration.
  2. Data protection. Losing data is really bad. For most users nirvana is verified, up-to-the-minute copies of data without having to worry a great deal about how it happens. Off-site protection is pretty big too. Talk to any DBA if you don't believe how important this problem is.
  3. Hardware utilization. With the dropping cost of hardware, concerns about up-front hardware investment are becoming somewhat outdated. Operational costs are a different matter. Let's look at power consumption and assume a dual CPU host drawing 250W, which we double to allow for cooling and other overhead. Using recent industrial electricity rates of 13.51 cents per kilowatt/hour in California you get an electric bill of around $600 per year. Electricity is just one part of operational expenses, which add up very quickly. (Thanks to an alert reader for correcting my math in the original post.)
We will continue to see database clusters in the future: in fact lots of them. But off-the-shelf clusters that meet the newer requirements in an efficient and cost-effective way for open source databases are going to look quite different from tightly coupled master/master or shared disk clusters like Postgres-R and RAC. Instead, we will see clusters based for the most part on far more scalable master/slave replication and with features that give them many of the same cluster benefits but cover a wider range of needs. To the extent that other approaches remain viable in the mass market, they will need to cover these needs as well.
  • Simple management and monitoring - The biggest complaint about clustering is that it's complicated. That's a solvable problem or should be once you can work with master/slave methods instead of more complex approaches. You can use group communications to auto-discover and auto-provision databases. You can control failover using simple, configurable policies based on business rules. You can schedule recurring tasks like backups using job management queues. You can have installations that pop up and just work.
  • Fast, flexible replication - Big servers create big update loads and overwhelm single-threaded slaves. We either need parallel database replication or disk-level approaches like the proposed PostgreSQL 8.5 log-streaming/hot standby or DRBD. Synchronous replication is a requirement for many users. Cross-site replication is increasingly common as well. Finally, replication methods will need to be pluggable, because different replication methods have different strengths; replication itself is just one part of the clustering solution, which for the most part is the same regardless of the replication type.
  • Top-to-bottom data protection - Simple backup integration is a good start, but the list of needs is far longer: off-site data storage, automatic data consistency checks, and data repair are on the short list of necessary features. Most clustering and replication frameworks offer little or nothing in this area even though replica provisioning is often closely tied to backups. Yet for many users integrated data protection will be the single biggest benefit of the new clustering approach.
  • Partition management - In the near future most applications will fit on a single database server, but most organizations have multiple applications while ISPs run many thousands of them. There need to be ways to assign specific databases to partitions and then allow applications to locate them transparently. This type of large-scale sharding is the problem that remains when single application databases can run on a single host.
  • Cloud and virtualized operation - In the long run virtualization is the simplest cure for hardware utilization problems--far easier and more transparent than other approaches. A large number of applications now run on virtual machines at ISPs or in cloud environments like Amazon for this reason. To operate in virtual environments, database clusters must be software only, have simple installation, and make very minimal assumptions about resources. Also, they need to support seamless database provisioning to as capacity needs rise and fall, for example adding new VMs or provisioning an existing 4 core VM to a larger 8-core VM with more memory as demand shifts.
  • Transparent application access - Applications need to be able to connect to clusters seamlessly using accustomed APIs and without SQL changes. This is actually easier to do on databases that use simple master/slave or disk block methods rather than more complex clustering implementations. (Case in point: porting existing applications to MySQL Cluster.) Also, the application access needs to be able to handle simple performance-based routing, such as directing reports or backups to a replica database. The performance scaling that most users now need is just not that complicated.
  • Open source - For a variety of reasons closed approaches to clustering are doomed to insignificance in the open source database markets. The base clustering components have to be open source as some of them will depend on extensions of existing open source technology down to the level of storage and database log changes. You also need the feedback loops and distribution that open source provides to create mass-market solutions.
What I have just described is exactly what we are building with Tungsten. Tungsten is aimed at the increasingly large number of applications that can run on a single database. We can help with database performance too, of course, but we recognize that over time other issues will loom larger for most users. The technical properties described above are tractable to implement and we have a number of them already with more on the way in the near future. Master/slave clustering is not just feasible--it works, and works well for a wide range of users.

Still, I don't want anyone to mistake my point. There are many applications for which performance is a very serious problem or whose other needs cannot possibly be met by off-the-shelf software. Facebook and other large sites will continue to use massive, custom-built MySQL clusters as well as non-SQL approaches that push the state of the art for scaling and availability. Analytics and reporting will continue to require ever larger databases with parallel query and automatic partitioning of data as Aster and GreenPlum do. There are specialized applications like Telco provisioning that really do require a tightly coupled cluster and where it's worth the effort to rewrite the application so it works well in such an environment. These are all special cases at the high end of the market.

Mainstream users need something that's a lot simpler and frankly more practical to deliver as an off-the-shelf cluster. Given the choice between combining a number of technologies like MMM, backups of various flavors, cron jobs, Maatkit, etc., a lot of people are just going to choose something that pops up and works. The hardware capability shift and corresponding database improvements are tilting the field to clustering solutions like Tungsten that are practical to implement, cover the real needs of users, and are fully integrated. I'm betting that for a sizable number of users this is the future of database clustering.

p.s., We have had a long summer of work on Tungsten, which is why this blog has not been as active as in some previous months. We are working on getting a full clustering solution out in open source during the week of September 7th. For more information check out full documentation of open source and commercial products here.

Aug 16, 2009

Tungsten Welcomes Your Contributions!

Tungsten clustering and replication has been accessible as open source for almost a year, but it has taken us an amazingly long time to get our contribution policy set up. The dithering ended promptly after Monty Widenius wrote an excellent blog article on dual-licensed software from his experiences at AskMonty.org and previously at MySQL AB. One of the things I especially like is Monty's emphasis on contributor rights. Contributor rights create the sense of reciprocity that makes open source function effectively as a development model. Tungsten is henceforth adopting the AskMonty.org contribution model.

So, if you want to contribute code to Tungsten (I'll describe shortly why you might to do that), you first fill out our handy Code Contributor Agreement and send it to us. The CCA says that you grant us rights to use the code within Tungsten as if we had written it ourselves, which includes selling it in licensed versions of Tungsten. At the same time, you retain your rights to the code and can also use it for any purpose you please including donating it to other projects, selling it, licensing it commercially, etc. It is really a very simple agreement. We also plan to match further protections to contributors as Monty.org adopts them.

After you send us the CCA, you can send us patches. Why would you want to do so? Recall that Tungsten allows you to create database clusters that protect you from data loss, failover quickly to replicas, and scale performance by spreading work around multiple data copies. Here are just a few ways you can help Tungsten do that better:
  • New types of backups. Anybody want to add built-in backups using InnoDB Hot Backup? How about storing backup files on Amazon S3? Note: The second project is very high up my personal list so you'll need to hustle.
  • Replication event filtering. Our forums see regular discussions about filtering (like this one). What about a handy filter to remove specific databases, tables, or columns when moving data from masters to slaves?
  • Support for new databases. Anybody need to replicate from MySQL to Amazon SimpleDB? How about Drizzle or PostgreSQL?
  • Sharding. Anybody need to connect transparently to databases spread across multiple shards? You can do it by extending SQL Router load balancing.
  • Fixing things that are broken.
We will accept any patch that provides beneficial improvements to Tungsten and will make an effort to integrate it quickly. In some cases we may have to rewrite them, which will of course delay integration. The more you work with us the faster that integration will occur.

Speaking of patches, we got an initial contribution to implement replication from MySQL into Drizzle earlier in the summer. Now that we have all the licensing paperwork in order it's time to get that one properly integrated.

Aug 9, 2009

Building the Open Source Hackers Cooperative

It is striking how much harder it is to make money from open source than to write it in the first place. Open source development is a sophisticated and well-understood social activity. However, the economic model is often laughably primitive: "if you build it, they will come." That applies to the question of turning your open source project into a real job. More interestingly, it applies to the question of how to make open source projects as valuable as possible to the largest number of people. In this post I would like to propose an answer to both questions.

To illustrate open source sophistication, just look how easy it has become to start and manage projects. It is almost a cookie-cutter procedure. You pick one of a number of well known licenses, manage the code on SourceForge.net or Launchpad, communicate with the project through skype and mailing lists, and tell the world about it using your blog plus Twitter. Within an afternoon you have set up infrastructure to support efficient collaborative development with team members from Seattle to Singapore. The number of projects itself makes the point: SourceForge.net alone has over 230,000 projects.

Things get much harder if you want to make a decent living from a successful project. It’s not that there is a lack of models to choose. You could form a commercial venture that offers closed source/licensed versions of your open source project. However, as many of us have seen with MySQL AB and other companies, commercial efforts have a tendency to conflict badly with broader community interests or those of other companies that might be useful partners. For this reason, they don’t even necessarily maximize the value of the original commercial effort.

You could of course try to create a foundation like Apache or Eclipse. However, these generally require established software and large commercial backers. The current experience of the fledging Open Database Alliance and other efforts shows that it can be quite complex to create such organizations from scratch even with well-established code and well-known players. This is not a model that can be repeated across a wide number of different projects large and small.

Finally, you could sell consulting and support for your project. The problem with the consulting model is that it is not scalable—in order to make a decent living you have to work on customer problems. The pro-bono work like extending the project or tending to a community tends to fall by the wayside unless you can get a customer to pay for it.

The fact that it is hard to make money off open source is a symptom of a larger problem: we are losing the wider social benefits that for many people are the real point of open source software. This is an economic problem: how do we allow hackers to make a reasonable living on open source projects while maximizing the long-term value of the software to the widest possible number of users? It turns out there’s a reasonable economic model that can do this: cooperatives, which are defined as follows in Wikipedia.

a legal entity owned and democratically controlled [equally] by its members

Cooperatives have existed for centuries in many different forms and have successfully solved problems ranging from providing student housing to delivering consumer goods like sporting equipment on a grand scale. We need a new form of cooperative that I propose to call the Open Source Hackers Cooperative, or Hackers Co-op for short. The Hackers Co-op has three different but co-equal types of members:
  • Hackers, who are the core committers to the project and stewards of the code
  • Sponsors, who supply funding and labor
  • Community members, who use the software, test it, and contribute patches
As we will see, some members are more equal than others, which is why I added brackets in the cooperative definition.

Here are the organizational principles for an Open Source Hackers Cooperative:
  1. The Co-op is a non-profit. It’s not for sale and there is no exit strategy. Like all co-ops, it exists to maximize benefits for its members.
  2. Hackers work directly for the Co-op. Their time is divided between implementing features that interest them, integrating patches as well as fixing bugs reported from the community, and implementing features for sponsors.
  3. Sponsors provide long-term funds and/or labor to the Co-op. Sponsors build businesses on the open source software and kick back a percentage of the business value in return for support and new features. They can also contribute labor for specific co-op tasks. Sponsors need not be for-profit businesses. They could in some cases even be governments or NGOs. The point is that they fund the software based on its value, for example through grants.
  4. Community members provide leverage to the development model. They use the software, provide basic support through forums, and contribute patches for bugs and small features. These activities leverage the hackers who can use community patches and feedback to evolve the software. This is a very efficient development model.
  5. The Co-op is a democracy. Co-op members vote on allocation of hacker resources. The vote is structured to keep a single group of members from hijacking the entire Co-op by dividing hacker time into a pie with allocations for the interests of each member type. Sponsors vote for their section of the pie using a vote weighted by their relative funding contributions--sponsors are not equal to encourage competition to commit more funds. Hackers effectively vote their portion by doing whatever they want with the time they get for personal projects. Community members vote through surveys or some other reasonable mechanism.
  6. The Co-op has elected officers. There is a chief economist who is in charge of the business model and plans finances, contracts with sponsors, arranges employment contracts, etc. There is also a chief technologist who ensures project infrastructure and moderates technical discussions. Co-op officers are elected or at least approved by the members at large.
  7. The Co-op pays dividends. Some open source projects are quite valuable, so a well-run co-op could easily become very profitable. The excess profits are distributed to hackers in the form of retirement and other benefits, to sponsors in the form of cash rebates, and to community members in the form of conferences, hiring of new hackers to work on features, improved infrastructure, etc.
You can try to run through a number of scenarios for the hacker co-operative to see how well the model holds together. All you really need is software with enough intrinsic value that it can sustain an active, technically aware community and where sponsors are motivated to build businesses on it but do not need to own the engineering. It is helpful to have community members be programmers, but you can also design the software to allow even non-technical users to contribute effectively.

Such conditions hold for a variety of system software like database management systems, app servers, and communications packages. With a little thought they could apply broadly to user applications like accounting systems or voting software, which so far are relatively rare in open source. The Co-op is model is quite stable, because it aligns interests in such a way that everyone does better if they stick together.
  1. Hackers earn a stable, comfortable living and public recognition for working on software that they enjoy. “Comfortable” in this context means salary and benefits equivalent to a typical EU country like Germany or Finland, which are the gold standard for employee compensation. This works economically because hackers are (a) very productive to begin with and (b) become more so by leveraging an active user community. Hackers are motivated to produce because the more viable the software is, the better the co-op does, and the more benefits they receive.
  2. Sponsors get features that they need using a productive open source hacking development model. This is a replacement for models like trying to take the software private and farming it out to low-cost offshore locations, which experience has shown to be badly broken on a number of levels. More important, sponsors get stability in the sense that the software cannot be taken over by hostile corporate interests and is supported over the long term, which lowers the risk of building businesses on it. They are motivated to contribute more in order to vote more resources for tasks that help their businesses.
  3. Community members get software that is continuously supported and evolving rapidly to add features they need. They get assurance that valuable patches will be integrated. They are motivated to use, test, and develop patches for the software as that further increases its value to them and leads to recognition as authorities on the project.
The Open Source Hackers Cooperative can be structured to create a number of virtuous feedback loops that will support and extend the software over a long period of time. The Hackers Co-op model could be standardized and even backed up with software as well as cookie-cutter legal documents so that it becomes very simple to set up and manage.

I don't know if there are projects already following this model. Selena Deckelman already used the term "Hacker's Cooperative" though from a somewhat different perspective. If you know of anyone who has worked through the cooperative economics fully I would appreciate hearing from you. Meanwhile, this is not a theoretical question for at least a couple of reasons.

First, I work for a for-profit company that is willing to sponsor projects in exactly the way described in this article. We are looking for (a) control, (b) stability, and (c) a development model that is cheaper than doing it ourselves. If such cooperatives existed we would be interested in them. I'm sure we are not alone, because this is how all for-profit businesses tend to think. The cooperative is a viable model because of the way it reinforces and maximizes member interests.

Second, the Hacker Co-op leads to some interesting questions about how to design software that promotes the leveraging effect of an open source community. It is possible to design even application software so the your users don't all have to be coders in order to contribute. If you want to create something really big out of your open source project (or just have a nice job) it helps to think these issues through before you write much code. Maybe it's the first thing on the list after you set up your new project.

Jul 20, 2009

Backups, Backups, Backups (and Restore)

Tungsten Replicator has built-in backup and restore for MySQL! I checked in the final touches over the weekend. Here's how to run a backup on a database and store it so you can restore it later. If you leave off options, we use the default back-up procedure and storage that you select when setting up replication.
trepctl backup [-backup agent] [-storage agent] [-limit timeout]
And here's how to restore. If you leave off the options, we find the latest backup in your default storage and load that.
trepctl restore [-uri backup_uri] [-limit timeout]
That's the syntax. Now here's what happens behind the scenes. First Tungsten Replicator has a new BackupAgent plug-in that implements backup procedures. We have a backup agent for each the following types of backups for MySQL:
  • Mysqldump - Probably least useful but easy to set up.
  • LVM snapshot to tar.gz - Scalable with miminal database down time. Features are similar to Lenz Grimmer's excellent mylvmbackup script.
  • Script dump - Integrate your own script for backup and restore. The script has to follow some very simple conventions, for which there is an example. You can integrate practically any backup/restore package this way.
Speaking as an unbiased user of the system I love the LVM snapshots. LVM snapshot is overall the most convenient way to do backups for a wide range of databases, not just MySQL, though I have to admit I have not used either InnoDB Hot Backup or Percona's XtraBackup. I guess now would be a good time to try them since we can integrate them through the script dump mechanism.

Meanwhile, mysqldump is far and away my least favorite mechanism, not least of all because of this unusually heinous bug, which completely breaks the mysqldump --all-databases command. It's still not fixed as of at least MySQL 5.1.34. (How on earth did this one get in and why didn't it get corrected instantly?)

Second, there is a new StorageAgent plug-in to handle storing and retrieving backup files. There is one of these for each type of storage. Currently the choice is limited to shared disk but I expect we'll have an Amazon S3 storage plug-in in the near feature. That's just too useful to pass up for very long...Among other things we ourselves run all our company services on Amazon and I would like to use it for our own backups.

If you want to use the new backup capability you can either build Tungsten Replicator yourself using the instructions on our getting started page or wait until the next binary build comes out in a couple of weeks. Backup and restore are documented here in the Tungsten documentation.

We went through a lot of effort to make the backup and restore processes as simple as possible. It's down to one keyword for each operation, so I don't think it's going to get much simpler. Please try it out and provide your feedback. I love bug reports and want to hear what you think.

Jun 20, 2009

When SANs Go Bad

They sometimes go bad in completely unpredictable ways. Here's a problem I have now seen twice in production situations. A host boots up nicely and mounts file systems from the SAN. At some point a SAN switch (e.g., through a Fibrechannel controller) fails in such a way that the SAN goes away but the file system still appears visible to applications.

This kind of problem is an example of a Byzantine fault where a system does not fail cleanly but instead starts to behave in a completely arbitrary manner. It seems that you can get into a state where the in-memory representation of the file system inodes is intact but the underlying storage is non-responsive. The non-responsive file system in turn can make operating system processes go a little crazy. They continue to operate but show bizarre failures or hang. The result is problems that may not be diagnosed or even detected for hours.

What to do about this type of failure? Here are some ideas.
  1. Be careful what you put on the SAN. Log files and other local data should not go onto the SAN. Use local files with syslog instead. Think about it: your application is sick and trying to write a log message to tell you about it on a non-responsive file system. In fact, if you have a robust scale-out architecture, don't use a SAN at all. Use database replication and/or DRBD instead to protect your data.
  2. Test the SAN configuration carefully, especially failover scenarios. What happens when the host fails from access one path to another? What happens when another host picks up the LUN from a "failed" host? Do you have fencing properly enabled?
  3. Actively look for SAN failures. Write test files to each mounted file system and read them back as part of your regular monitoring. That way you know that the file system is fully "live."
The last idea gets at a core issue with SAN failures--they are rare, so it's not the first thing people think of when there is a problem. The first time this happened on one of my systems it was around 4am in the morning. It took a really long time to figure out what was going on. We didn't exactly feel like geniuses when we finally checked the file system.

SANs are great technology, but there is an increasingly large "literature" of SAN failures on the net, such as this overview from Arjen Lentz and this example of a typical failure. You need to design mission-critical systems with SAN failures in mind. Otherwise you may want to consider avoiding SAN use entirely.

Jun 17, 2009

Lots of New Tungsten Builds--Get 'Em While They're Hot

There is a raft of new Tungsten open source builds available for your replication and clustering pleasure. Over the last couple of days we uploaded new binary builds for Tungsten Replicator, Tungsten Connector, Tungsten Monitor, and Tungsten SQL Router. These contain the features described in my previous blog article, including even more bug fixes (36 on Tungsten Replicator alone) than I had expected as we had a debugging fest over the last few days that knocked off a bunch of issues. You can pick up the builds on the Tungsten download page. Docs are posted on the Tungsten wiki.

If you have questions, see problems with the builds, or just want to tell us how great they are, please post on the community forums or on the tungsten-discuss mailing list.

Our next open source release will be the Tungsten Manager, which is long overdue to join the family of regular builds. We are doing some polishing work on the state machine processing and group communications, after which the Manager will go out along with documentation on how to use it.

Jun 10, 2009

Tungsten Development News - Lots of New Features!

Articles on this blog have been pretty scanty of late for a simple reason--we have been 100% heads-down in Tungsten code since the recent MySQL Conference. The result has been a number of excellent improvements that are already in Subversion and will appear as open source builds over the next couple of weeks.

Tungsten has a simple goal: create highly available, performant database clusters using unaltered commodity databases that are simple to manage and look as close to a single database as possible for applications. Over the last two months we completed the integration of individual Tungsten components necessary to make this happen.

Full integration is a big step forward and finally gets us to the ease-of-use we were seeking. Imagine you want to add a slave database to the cluster. There's no management procedure any more--you just turn it on. Managers in the cluster automatically detect the new slave and add it as a data source. That's the way we want every component to work from top to bottom--either on or off, end of story. It was really nice to see it start to work a few weeks ago.

We are now ready to start pushing builds out to the Tungsten SourceForge.net project. Here is a selection of the features:

Tungsten Replicator -- API support for seamless failover, certification on Solaris, better Windows support, testing against MariaDB, and many other improvements like flush events for seamless failover. There are already 26 fixes in JIRA and I expect more before we post the build.

Tungsten SQL Router -- Pluggable load balancing with session consistency support. Session consistency means users see their own writes but can read changes by other users from a slave. It works using a single database connection, which is an important step toward eliminating application changes in order to scale on master/slave clusters.

Tungsten Manager -- Directory-based management model that allows you to view and manage both JMX-enabled services as well as regular operating system processes that follow the familiar LSB pattern of 'service name start/stop/restart'. The managers use group communications and can broadcast commands across multiple hosts, handle failures, and automatically detect new services as they come online.

Tungsten Monitor -- Improved monitoring of replicator status including slave latency, which is necessary to guide SQL Router load balancing features like session consistency.

There's a lot going on with Tungsten right now, in fact far too many things to mention even in a longish post like this one. One of my current code projects is to implement built-in backup and restore for Tungsten Replicator. I am planning on supporting slave auto-provisioning: a new slave comes up, restores the latest backup, and starts replicating. All you have to do is turn the slave on. (More of that on/off stuff--it's kind of an obsession for us at this point.)

Integrating backup/restore is the final big feature for Tungsten Replicator 1.0--after this we plan to turn attention to parallel replication and are already discussing how this might work with several potential customers. Feel free to contact me through this blog or better yet post on the community forums parallel replication topic to join the conversation.

One final bit of news, we are starting to work seriously on Tungsten PostgreSQL integration thanks to a new partnership between Continuent and 2nd Quadrant. This work is commercially focused for now but will lead to additional open source features in the not too distant future. Keep watching this space... :)

p.s., We also had a nice refit on the community website. Check it out.

May 13, 2009

Continuent is Joining the Open Database Alliance

Maybe it's a sense of shared adversity, but recent MySQL meetings have had this "we're all in it together" feeling. Today Monty Widenius announced the Open Database Alliance: the community feeling is starting to look like a real business entity.

The Open Database Alliance is appealing at multiple levels. First, it's good for the companies that join--a steadier flow of business and ability to offer bigger solutions by combining with partners. Second, it's good for users: first rate software, services, and support without vendor lock-in. Third, the parties are going to be excellent.

Sometimes you have to think hard before signing up for partnerships. But this one looks like a no-brainer. Count us in!

p.s., Stay tuned for Tungsten certification against MariaDB. If you have tried the Tungsten Replicator already with MariaDB, please post your experiences on the Tungsten Replicator Forum.

Apr 29, 2009

Overcoming MySQL-to-Oracle Culture Shock

Migrating from Oracle to MySQL is not easy. A few weeks ago Baron Schwartz summarized the culture shock in 50 things to know before migrating Oracle to MySQL. It's a great article but as you read through the comments it's easy to forget that culture shock can run the other way.

For example, try building horizontally scaled systems. Oracle has excellent "small" database editions like SE and SE1. However, they lack built-in replication of the type provided by MySQL. Even simple and effective deployment patterns like master-master replication do not exist. The usual approach in the Oracle world is to use RAC + Enterprise Edition features like Streams and DataGuard. That's great for large enterprises, but it's not a good method for smaller businesses and start-ups.

We have been working for some time on a better answer. We are now opening up for general beta testing a commercial extension to our Tungsten Replicator to address replication for Oracle. The new extension adds a process to read Oracle redo logs but otherwise fits neatly into the overall replicator design. It works on Linux Oracle Editions from XE to EE.

Implementing Oracle replication has been a long and arduous effort. Oracle has a huge feature set and a correspondingly elaborate log. It is far more challenging to read than the MySQL binlog. We currently handle basic data types as well as DDL statements. Large object types and XML are on the way. The implementation is a step-by-step process and one that needs to be guided by close work with customers.

On the other hand, Oracle has the features to make advanced replication really work. Most Oracle DBAs know about supplemental logging, which among other things adds keys to data so you can identify updated rows unambiguously. However, there are also far more interesting features like flashback queries, which allow you to see the state of the database at earlier points in time. It makes generating SQL from log entries much easier because we can see the state of system catalogs as of the exact time each update occurred. Flashback query was not on Baron's list or the comments that followed, but it is one of the truly great features of Oracle databases.

If you are interested in alternatives for existing Oracle replication, I would like to encourage you to contact us at Continuent. We are looking for customers who want to work closely with us to build out economical Oracle replication support. MySQL has shown over the years the power of lightweight, simple-to-use replication. It's going to be pleasure to make it work on Oracle.

Finally, there needs to be a list of 50 things you need to know about migrating from MySQL to Oracle. Open source databases are popular not just because they offer free downloads. Simplicity of operation, replication, and support for incremental scale-out patterns are among the strengths of MySQL. It takes some thought and effort to translate them into Oracle.

p.s., Since I wrote this article Robert Treat obligingly started the Oracle to MySQL 50 things list. Several people chipped in to get it up to 50.

Apr 26, 2009

Tungsten Replicator Build 1.0.1 Available

A new build of the Tungsten Replicator is now available. As you probably know from reading this blog Tungsten Replicator provides advanced open source replication for MySQL. There is also a commercial extension to support Oracle. Tungsten Replicator 1.0.1 includes a number of important improvements.
  • Much better performance -- Current benchmark results show throughput of up to 650 inserts per second using a single slave apply thread. We are well on the way to our goal of 1000 inserts per second.
  • Simplified management -- Replicator administration has been largely reduced to two commands: online and offline. There is an option to go online automatically at startup, which further simplifies operation and makes it easy for the replicator to operate as a service.
  • Easy-to-use consistency checks. You just type trepctl check database.tablename.
  • Lots of bug fixes and small improvements. Check the release notes in file README.UPGRADE.
We also have some great features on tap for the next couple of releases. An integrated flush operation to simplify failover, built-in backup/restore, and parallel replication are just a few. I'm particularly excited about parallel replication, as it has the potential to boost throughput into the 1000s of updates per second and to support sharding as well. You can track development progress on the Tungsten Replicator JIRA list.

For more information check out the Tungsten Replicator community pages. You can grab binary downloads or look at source code on the Tungsten project on SourceForge.net. The 1.0.1 build is a considerable improvement over the previous beta releases and I hope you will try it out. We look forward to your feedback.

Apr 24, 2009

MySQL Conference Impressions and Slides

"Interesting" was probably the most overused word at the MySQL Conference that just ended yesterday. Everyone is waiting to find out more about the Oracle acquisition of Sun. As a community we need to find some synonyms or things will become very tiresome. Personally I vote for intriguing.

Here are slides for my presentations at the MySQL Conference as well as the parallel Percona Performance is Everything Conference. Thanks to everyone to attended as well as to the organizers. You had wonderful ideas and suggestions.


Finally, some short impressions on the conference. The two most intriguing trends were advances in hardware, especially memory and SSDs, as well as clouds. These are altering the economics of computer in fundamental ways: business costs as well as performance trade-offs in many of the basic algorithms for data management. Combined with the ferment of projects spinning off from MySQL and others, they are fueling an incredible burst of creative thinking about databases.

By comparison, Oracle consuming Sun is merely interesting.

Apr 7, 2009

Tungsten Replicator at the 2009 MySQL UC

It's good to get out of the office and meet people. This year I'll be doing several presentations at the 2009 MySQL Conference and adjacent Percona Performance Conference in Santa Clara. These include among others a talk on Tungsten Replicator on Thursday April 23 at 10:50.

In case you don't read this blog regularly, Tungsten Replicator provides advanced open source replication for MySQL. The term "advanced" is not an exaggeration. I'll be covering how to solve practical problems including the following:
  • How to install and configure Tungsten Replicator in 5 minutes or less.
  • How to set up seamless slave promotion after a master fails.
  • How to prevent loss of data from administrative errors using time delay replication.
  • How to identify data inconsistencies using built-in checksums.
  • How to move data from higher to lower MySQL versions.
  • How to reduce slave latency by dropping DDL commands from replicator events.
  • How to recover quickly and easily from statements that fail to replicate properly.
  • How to replicate from MySQL to Oracle as well as things that are not even databases.
There will be some short demos along the way. I hope you'll join me for a fun and informative talk. And please bring interesting replication problems with you!

p.s., For extra credit download Tungsten Replicator and try it out before the talk. I look forward to your questions and comments.

Apr 3, 2009

Contemplating the MySQL Diaspora

The break-up of the MySQL codeline is finally attracting attention from polite society outside the open source database community. This attention has been accompanied by much speculation, some of it informed and some not so informed about what is driving the split. Since everyone else is chipping in theories about how and why, here's mine:

It's the economy, stupid.

First, MySQL AB seeded a huge market for the MySQL database. MySQL 5.1 for all the controversy hit a million downloads in a little over a month. This is open source success on a grand scale that has created a huge pent-up demand for bug fixes as well as new features from a wide variety of users. Leaving aside consideration of Sun/MySQL misteps, it's somewhat hard to see how Sun would meet the competing market demands and still keep the database simple enough for everyone to use easily.

Second, the core MySQL server code is licensed under GPL V2, so anyone can take a copy, modify it, and create their own distribution. There is abundant proof from companies like Percona and many others that you can create viable businesses by offering services on these distributions without owning the code. That's critical because it means alternative branches are economically viable.

Third, pure open source projects can innovate very rapidly because they can accept contributions from the entire community. However, not everyone can or will merge the same patches--the Google semi-synchronous patch is a good example of a very useful patch that is also non-trivial to merge. So the split between branches is likely to increase over time depending on which part of the MySQL market each project chooses to serve. That's not even considering more-or-less full breaks like Drizzle.

OK, maybe it's cheating to steal catchy lines from James Carville, but this looks like simple economics at work. There is a huge market, plenty of room for businesses that don't own the code, and lots of opportunities for alternative versions.

There are arguments from people like Jeremy Zawodny that MySQL will hold together like the Linux kernel with different distributions around a common core. Once you get a lot of participants that kind of standardization is tough to manage. In fact one of the real strengths of open source development is that it does not follow standards. Sun no longer really controls the core of MySQL, and there are a lot of motivations to change it.

In the end what's happening to MySQL looks a bit like the fracturing of Unix in the 1980s--the BSD and System V variants quickly evolved into a separate version for each hardware vendor. There were various attempts to standardize, but they weren't especially successful. Instead, Intel undermined the proprietary chip model which in turn made the other hardware vendors less viable. Now we all run Linux or Windows.

One final thing--what does this mean for users? I think Jeremy has it right that at some point it does not matter. We are at the beginning of an era of multiple viable choices for open source databases. Some users will choose one of the new MySQL builds. Some users will jump ship to PostgreSQL. However, people being what they are, a lot of users will just stick with the version that they are currently running. In spite of other misfortunes that should be at least some consolation for Sun.

Mar 29, 2009

Implementing Relaxed Consistency Database Clusters with Tungsten SQL Router

In December 2007 Werner Vogels posted a blog article entitled Eventual Consistency, since updated with a new article entitled Eventually Consistent - Revisited. In a nutshell it described how to scale databases horizontally across nodes by systematically trading off availability, strict data consistency, and partition resilience as defined by the CAP theorem. According to CAP, you can only have two of three of these properties at any one time. The route to highly available and performant databases, according to Vogels, is eventual consistency in which distributed database contents at some point converge to a single value but at any given time may be inconsistent across replicas. This is the idea behind databases like SimpleDB.

I read the original blog article at about 2am on a Sunday morning. It was like a thunderclap. Like transactions and state machines, CAP was one of those ideas that provide instant clarity to a large class of problems, in this case related to database availability and performance. But it also raised an immediate question: can't we apply CAP systematically on conventional SQL databases? That way you don't have to throw away the relational database baby away with the strict consistency bathwater.

This is not an implausible idea. Most database engines have built-in master/slave replication to at least some degree, so there's no problem distributing data. (Shameless plug: If you don't like what your database provides, try ours.) The real problem is that you need to change how applications access the database. They need to implement CAP trade-offs in a consistent and easily understandable way. That's where the Tungsten SQL Router comes in.

Tungsten SQL Router is a thin Java JDBC driver wrapper that enhances conventional JDBC drivers to implement database session semantics based on CAP. SQL Router adds a "quality of service" or qos to each database session. Being programmers we had to invent our own terms, so here are the initial qos values.
  • RW_STRICT -- This session is used for writes; all data are strictly consistent, i.e., appear to all applications on RW_STRICT sessions as soon as they are written. In CAP terms you are picking data consistency + partition tolerance. (Vogel's article uses the term "causal consistency.")
  • RO_RELAXED -- This session is used for reads; data consistency is relaxed, i.e., represents data at an earlier point in time. In CAP terms you are picking availability + partition tolerance. (Vogel's article uses the term "monotonic reads.")
Clients can request the preferred quality of service whenever they create a new transaction. The SQL Router then takes care of connecting to a database that meets the semantics. Here's a typical Tungsten SQL Router URL (i.e., connection string) that routes connections to a MySQL master database:
jdbc:t-router://myservice/mydb?qos=RW_STRICT&createDatabaseIfNotExist=true
The SQL Router only steps in to select connections and to break them as necessary when databases go offline. It has almost no performance impact on Java applications, because we don't touch result sets and there are no proxy hops. That's an important requirement to achieve maximal application transparency.

Making CAP work properly for conventional applications is not entirely straightforward, which is one of the main reasons why you don't want the logic to be a part of your application. Here are some of the key features that Tungsten SQL Router provides.
  • Distributed database services. SQL Router groups databases into "services." Each database in the service is defined using a simple resource file that defines its name, location, and role (e.g., master or slave).
  • Remote management interfaces. Databases fail or go offline for maintenance and cluster resources change over time. Strict consistency connections in fact explicitly choose to fail when the database is no longer available rather than access old data, so you must handle failover easily. Tungsten SQL Router has a built-in JMX administrative interface that allows you to promote a slave database to become a master, take databases offline, bring them back online, etc., without disturbing or even necessarily notifying applications.
  • Support for non-Java applications. The world is a diverse place and not every application is written in Java. You can embed the SQL Router in the Tungsten Connector, a proxy that allows native MySQL and PostgreSQL applications (Perl, PHP, Ruby, name your favorite...) to connect without library changes or even changing connection strings.
  • Integration with connection pools. SQL Router provides call-backs that can be used to let application connection pools know when to give up applications. A little cooperation here makes things like failover work much more smoothly.
There are other features but it's probably simplest if you visit the Tungsten Project on SourceForge.net, read the wiki documentation, download a copy, and try it out for yourself. There's general information about Tungsten on our community website. Note: our community site is due for an update shortly to add more information about SQL Router and other new projects we are releasing. For the next few days please check out SourceForge.net.

Finally, here's an interesting thought that shows the power of applying CAP semantics in SQL applications. So far we have been talking about database replicas. However, SQL Router relaxed consistency sessions could just as easily read query data from a distributed cache like memcached. An application that specifies qos=RO_RELAXED on a connection is saying it will accept possibly out-of-date data in return for availability. Semantically there is no difference between a cache and a database replica--you can substitute any conforming storage implementation. Exploiting that idea pretty much defines our long-term roadmap for the SQL Router.

In summary SQL Router provides a simple model so that applications can choose whether they want availability or full data consistency while ensuring basic properties like partition resilience. These semantics are key to extending the scale-out database design model to increasingly large clusters, and equally important, to make that model easy to use for clusters of all sizes. Tungsten SQL Router is a work in progress, but the idea of using CAP semantics really seems to have legs. I hope you will try it out and let us know what you think.

p.s., I would like thank David Van Couvering for pointing out Werner Vogel's article in his blog as well as my colleague Ed Archibald for getting the SQL Router off the ground. Nice working with you guys. :)

Mar 17, 2009

Announcing Tungsten Monitor

Yesterday I posted about our release of Tungsten FSM, a package for building state machines in Java. Today we are publishing Tungsten Monitor, the second of four new Tungsten packages we are releasing on SourceForge.net during the month of March. Tungsten Monitor offers fast, pluggable resource monitoring for database clusters. We have a couple of specific monitor types already implemented; you can add new ones with minimal Java code.

Tungsten Monitor is focused on a single problem: providing continuous notifications of the state of resources in a cluster. Each monitor executes a simple process loop to check the resource and broadcast the resulting status. The status information answers the following questions:
  • What type of resource is this?
  • What is its name?
  • Is the resource up or down?
  • If a replica database, what is the latency compared to a master?
Tungsten Monitor is exceedingly simple--the current version has a total of 13 Java classes. (Who says Java needs to be complex?) Despite this simplicity the monitor has at least three very interesting features.

First, there's no centralized agent framework. You just start a monitor on a replicator, database, etc., and it starts to generate notifications. The off-the-shelf configuration monitors Tungsten Replicator--to get started you unpack code, copy one file, and start. That's it. The monitor figures out everything else automatically.

Second, Tungsten Monitor broadcasts notifications using group communications, which is a protocol for reliable, broadcast messaging between a set of processes known as a "group." Manager processes can tell that a new resource is available because its monitor joins the group and starts to broadcast notifications that the resource is up. If the new resource is a database, this is enough for a smart proxying service to start sending traffic to it automatically without any further management intervention.

Third, you can add new resource checks and notification methods yourself. With just 13 Java classes to start, you obviously won't be getting a lot off the shelf. :) We will add MySQL, PostgreSQL, and Oracle monitors shortly but any competent Java programmer could do the same in an hour or two without waiting for us.

In the meantime, you can download Tungsten Monitor source and binary builds from SourceForge.net. Also, here's a wiki article that describes basic installation and configuration. Take it for a spin, especially if you are using Tungsten Replicator already.

Interested? I hope so. Tungsten Monitor integrates with another of our March projects to route SQL connections automatically to available databases in a master/slave or master/master cluster. Stay tuned for more on that next week. Things are just starting to get interesting around here.

p.s., I did a bit of coding on the monitor but it really belongs to Gillles Rayrat, a colleague of mine in Grenoble. Nice job, Gilles!

Mar 15, 2009

Announcing Tungsten Finite State Machine Library

It is my pleasure to announce publication of the Tungsten Finite State Machine Library (Tungsten FSM) as an open source package hosted on SourceForge.net. This is the first of four new components for database clustering and replication that we will be releasing into open source during the month of March.

Tungsten FSM is a Java library for implementing in-memory state machines. It is lightweight and very simple to use. Each Tungsten FSM state machine tracks the state of a particular instance (i.e., object) in Java. The model provides standard features of state machines including states, transitions, actions, and error handling. Source and binary downloads are available here--there is also wiki documentation that explains how to use Tungsten FSM with code examples.

Here's a little background on the Tungsten FSM library and how it arose. State machines let you model the behavior of complex systems including system states and input/outputs in a simple and understandable way. They are as important for distributed systems as transactions are for databases. Among other things, state machines enable you to ensure that network services behave deterministically when presented with multiple, concurrent inputs. That determinism in turn is a fundamental requirement for organizing groups of processes into database clusters, which is what we do at Continuent.

When we embarked on development of Tungsten Replicator, it was immediately apparent we would need to implement state machines. Most of the available libraries, such JBoss jBPM, were heavyweight or otherwise difficult to embed. We therefore wrote a lightweight library of our own, adding features as we ran into practical implementation issues like dealing with errors. Tungsten FSM helps us organize code in services--for instance, it has been very easy to add new administrative operations to the Tungsten Replicator simply by adding more state transitions.

However, you don't have to take my word for it. Try out Tungsten FSM and let me know how you like it. For more information on Tungsten in general, please visit our community pages.

p.s., I'm posting this article to aggregators for MySQL and PostgreSQL even though it's not directly related to databases per se. State machines turn out to be essential to database clustering and management, as you'll see in some of the succeeding articles on this blog.

Feb 3, 2009

Thoughts on Replication from 2 Feb SFO MeetUp (slides too)

Last night I gave a presentation on Tungsten Replicator to the SFO MySQL MeetUp. It was really fun--lots of excellent questions and ideas from an audience that knows MySQL in and out. Here are slides from the talk from our trusty S3 document repository.

There were many things to think about on the BART ride home to Berkeley but here are a couple that really struck me, in part because I think we can do something about them.

First, database replication needs to expose as much monitoring data as possible. The kind of data you get or can infer from SHOW SLAVE STATUS is just the beginning. This came through really strongly from people like Erin who are using load-balancers and other automated techniques for distributing load to replicas using decision criteria like availability and latency of slaves. We have heard this from people like Peter Zaitsev as well--it's starting to sink in. :)

Second, it's critical to think through the database failover problem fully. There seem to be two axes to this problem. The first is what you might call call a vertical axis where you think about a single component in the system--here the database. You have to cover not just swapping replication flow but also enabling/disabling writes, triggers, and batch jobs, as well as application specific tasks. (Lots of good comments from the audience here as well.)

The other failover "axis" is horizontal where you think about the ensemble of databases, proxies, applications, and utilities that make up the cluster. The issues to cover here including sending commands in parallel, ensuring everyone receives them, and dealing with a wide range of possible failure conditions ranging from network partitions to failure of individual OS-level commands. We plan to unveil a solution shortly in the form of the Tungsten Manager, which uses group communications to attack this problem. I can't wait to get feedback on that.

p.s., Baron Schwartz yesterday posted a very interesting article on cache issues with failover on InnoDB. Just another example of how broad the failover problem really is.

Feb 1, 2009

Simple HA with PostgreSQL Point-In-Time Recovery

Point-in-time recovery or PITR is one of my favorite PostgreSQL features. Most database servers have a transaction log for recovery. (MySQL is one of the few exceptions; more on that shortly.) However, PostgreSQL PITR allows and even encourages users to manipulate the logs to perform some very useful tasks:

* Creating a bit-for-bit backup of a server without halting
* Restoring a server to a specific point in time
* Creating a warm standby server

The third task is especially interesting because it's so common. One of the most pronounced trends in computing is the decreasing cost of computing power through Moore's law improvements and virtualization. Standby solutions nowadays look like a good investment compared to the cost having a server down or worse yet losing a lot of data.

PostgreSQL PITR makes warm standby relatively easy to set up. You enable log archiving, which copies the completed WAL files to an archive directory, then transfer those files to a host where another PostgreSQL server is in continuous recovery mode. This is easier to understand with a picture, so here it is.


Two key points: first, warm standby is based on log files, so as long as you do not mess up the setup, your recovered data will always be consistent. Second, failover is really fast, because your standby server is in continuous recovery mode and generally will be on the last file or two, hence almost ready to go at all times.

There are some details but they are not very hard. The first step is to enable archiving. This copies log buffers into an archive directory from the standard pg_xlogs directory. Archiving is easy to enable in postgresql.conf, as shown by the following example:
archive_mode = on               # enable archiving
archive_command = 'cp -v %p /data/pgsql/archives/%f'
archive_timeout = 300 # timeout to close buffers
Next you need to transfer the files as they appear in the archive directory. For that you can use rsync, as shown below. Put it in a cron job on the standby so that files transfer automatically between hosts every few minutes.
rsync -avz --remove-sent-files prod1:/data/pgsql/archives/ \
/data/pgsql/archives/ > /dev/null
Finally, you copy the files into the pg_xlogs directory of a PostgreSQL server that is running recovery. This is the most complicated part. To make a long story short, you set up a carbon copy of server data on the standby and then create a file called recovery.conf with a command to copy the files from the archive directory to pg_xlogs. That command is usually pg_standby, a handy little program written by Simon Riggs of 2nd Quadrant. Here's a recovery.conf file with typical pg_standby configuration.
restore_command = '/usr/local/pgsql/bin/pg_standby
-l -d -s 2 -t /tmp/pgsql.trigger /data/pgsql/archives
%f %p %r 2>>standby.log'
Basically pg_standby blocks until it either finds the next log file requested by the server or sees a file named /tmp/pgsql.trigger. Then recovery is done and your standby server pops up and is open for business.

The PostgreSQL docs provide a good overview of warm standby. However, you need to get the details exactly right for recovery to work without losing data. I omitted these from the description but wrote a script called standby.sh to illustrate. It avoids some common gotchas in setup such as not cleaning up directories on the standby server or leaving the pg_standby trigger file lying around for unsuspecting programs to trip over. I'm sure there are better ways but this works for me.

Warm standby is thus a standard HA technique used in many PostgreSQL installations. It should be used in many more. PostgreSQL 8.4 is adding "hot standby", which allows users to open up the standby server for reads, similar to Oracle Active Data Guard, a pricey but very flexible availability solution for Oracle.

There's also an effort afoot to add synchronous replication of log changes to the standby, but it looks as if that will be delayed until PostgreSQL 8.5. If you really want to protect against data loss you can replicate the pg_logs directory using DRBD, then mount and copy any missing log files during failover.

So as you can see, PostgreSQL has a really nice solution to warm standby that is fast and ensures data consistency with options to eliminate data loss almost completely. Can MySQL do this? The answer is "kind of." The problem is that MySQL servers do not have a transaction log. Logging is the responsibility of individual storage engines. Some, like MyISAM, do not even have a log. The closest equivalent in MySQL is the binlog, which is used for replication.

In fact, a lot of people use MySQL master/master replication to implement warm standby. You can in theory reduce data loss by copying master binlogs over to another host and having that server read from them to distributed "missed" events to the slave(s). This is complicated and there are some two-phase commit problems with MySQL binlogs that mean sooner or later there will be data inconsistencies. Most MySQL users either just accept the chance of data loss or use DRBD to sync the entire master over to another host.

On balance, the PostgreSQL warm standby solution using PITR compares very favorably to MySQL as well as many commercial databases. Two things that would help are better management and proxies to help redirect reads to hot standby servers, once these are available. Our development team is working on both of these. As usual, stay tuned...

P.S., In response to some good comments from readers the comparison between MySQL and PostgreSQL replication deserves a fuller treatment in a separate article. PostgreSQL gives you a bit-for-bit copy of data, whereas MySQL makes a logical copy using SQL. PostgreSQL warm standby is therefore transparent to applications, whereas MySQL replication has some prerequisites for SQL schema and operations that applications must observe.

Jan 23, 2009

Tungsten Replicator Presentation on 2 Feb in San Francisco

If you want to learn about Tungsten Replicator first hand, I'll be doing a presentation at the February 2nd MySQL Meet-up in San Francisco. With luck I'll be able to demo not only the replicator but also some cool Amazon/RightScale integration we will be introducing in February. Teaser: It's possible to set up database clusters faster in Amazon than on a laptop.

Advance thanks to Mike Tougeron and Rich Nigra for the invite. For people suffering from obsessive curiosity about MySQL replication or database clustering in general, this talk is for you. Hope to see lots of you at the meet-up.

Jan 21, 2009

Encrypting Replicated Data

Open source projects have a great way of pulling in original ideas. Take encrypting replicated SQL: it's a big deal when you are are not sure about network security. This problem comes up very quickly when you transmit data over WAN links or in shared/cloud environments.

I have been procrastinating on implementation of encryption in the Tungsten Replicator because it felt as if we were going to have to do some surgery for it to work correctly. (Another hypothesis is that I'm lazy.) However, this morning I was talking via Skype to Mark Stenbäck, an engineer from Finland whom we met through a recent conference. Mark had a great idea. Why not just write an event filter? An excellent thought...Here's how it could work.

Filters transform or alter SQL events in the Tungsten Replicator. Let's say that you are replicating using MySQL 5.0. All SQL events contain SQL statements, which look like the following in our transaction history storage. The field 'SQL(0)' is the statement. Everything else is metadata.
SEQ# = 97
- TIME = 2009-01-12 10:01:46.0
- EVENTID = 000282:20633
- SOURCEID = centos5d
- STATUS = 2
- SCHEMA = mats
- SQL(0) = DROP PROCEDURE IF EXISTS updateRow
The SQL statement is contained in a simple data structure that is easy to manipulate within filter code. Now suppose you write the following Java classes:
  • StatementEncryptionFilter -- Looks at the event and encrypts the statement, using secret key (2-way) encryption.
  • StatementDecryptionFilter -- Looks at the event and decrypts the statement using 2-way decryption.
Implementing encryption itself is not very difficult in Java. You can follow the cryptographic examples in the Java documentation to get something working quickly. You then add the filters into the Tungsten replicator.properties file. The configuration might look like the following:
# Pre-storage filter(s).  Runs on events before they are stored
# in the transaction history log.
replicator.prefilter=logger,encrypt

# Post-storage filter(s). Runs on events after pulling them from
# storage and before handing off for application.
replicator.postfilter=decrypt,logger
...

# Encrypt/decrypt filter definitions.
replicator.filter.encrypt=\
org.foo.tungsten.StatementEncryptionFilter
replicator.filter.encrypt.password=!encrypt?!
replicator.filter.decrypt=\
org.foo.tungsten.StatementDecryptionFilter
replicator.filter.decrypt.password=!encrypt?!
There are some subtleties here--when there are multiple filters you need to make sure that the encryption filter is the last one called and similarly that the decryption filter is called first. Otherwise the SQL statements will look like garbage when other filters try to read them.

Another subtlety is row replication, which uses lists of objects to contain replicated values. These are harder to encrypt. In this case it might make sense to alter the Tungsten Replicator architecture to add a configurable storage filter that serializes and deserializes events in the transaction history log. This would be a nice feature addition. It's now JIRA TREP-199--we should be able to implement it within a couple of builds.

To get back to the topic of open source, Mark's suggestion is an excellent example of the power of working with a diverse group of people with similar interests. His idea seems straightforward to implement and also leads to a simple but powerful improvement to the replicator design.

Do you have any ultra-cool replication ideas? Post them on the Tungsten Replicator mailing list or just implement them directly. I'm going to set up a project for extensions to the replicator. Contributions are welcome--they will be open source and free to everyone. We'll gladly extend the Replicator to help make your ideas work. Meanwhile I'll keep posting other neat ideas for extensions. Please come on down and join the fun.