Until recently, I was a student employee at the Oregon State University Open Source Lab. My career there ended, like many, with that painful process known as graduation. I got invaluable experience at the lab, not the least of which being the knowledge gained as their main (only) database administrator. One of my great pleasures in that position, was learning how to configure MySQL replication and manage clusters of replicating database servers. Even the simple case of a single master and a single slave has its edge cases. There are endless tips and tricks to make long-term maintenance easier and to allow you to make full use of this technology. Running replicating database servers for Drupal in particular has several tricks to it:
Replication Settings - Timewarp
A major issue with replication in MySQL is that it has a tendency to fall behind. There are various tricks you can employ to assist with this and some big ones for Drupal specifically. First, if you are going to use the search module, you will need to stop the replication of the temp tables it uses. These will be replicated to the slave as they are technically volatile queries, however, they are then never used on the slave. Therefore, they are just replication overhead and because replication is single-threaded they will block the slave's replication thread until they are finished. The way to stop these is to add the following to the my.cnf on the slave:
where "drupal" is the name of the database containing your site. This simply tells MySQL to ignore all replicated commands pertaining to these tables. This means all of those create temporary table statements used by the search module, specifically the do_search method.
Another major reason for replication falling behind is the Drupal cron. Drupal is missing some indexes on the tables that get swept by cron for cleansing. This usually doesn't matter, but as noted above replication is single threaded. Going through these tables without indexes takes far too long and stalls the replication thread. The big problem tables here are history and watchdog, watchdog in particular. You will need to add an index to the timestamp columns of these tables or replication will eventually start stalling.
Failover - Preparing For Disaster
It is useful to be able to fail back and forth between your master and slave. You can do this either by changing the actual site configuration, changing a DNS pointer or actually utilizing something like linux-ha (linux-ha.org) to have true IP based failover. However you decide to implement this, it is not an instant process and there is the possibility (the likelihood) that for a second you will have processes writing to both DB's. To prepare for this, you need to configure auto increment offsets. This will allow for auto_increment columns on the slave and the master to choose different numbers and thus not conflict. You do this with an offset and an increment. For example, you might do this:
This means that the auto_inc columns will all increment by a block of 2 (actually 3) and this particular server will offset 2 into that block to choose its ID #. The other server would be setup in exactly the same way, but with a different offset, perhaps 1. This is not a perfect solution for Drupal, at least at the moment, but helps significantly. Down the road this will become much more effective. As it is, I would avoid the DNS methods of failover that prolong the "purgatory" period where both DB's maybe being written to.
Replication Checking - Reporting
The other major issue with replication is that there is no data security. I mean absolutely none. If there is line noise or other stack errors, the DB will gracefully accept them and never question the data its being given. For this reason, I highly recommend you check out maatkit: http://www.maatkit.org/. A tool included in this kit is mk-table-checksum. This allows you to run a checksum across the tables on the master and the slave. There are several ways to do this, one being just a straight up run on both boxes and a comparison. However, you can also create a special table on the master and do a checksum on the master that will be inserted into this table and replicated to the slave. Then the checksums are checked on the slave as they replicate back. This tool is very cool and has uncovered a number of issues for me in the past. I recommend running this weekly via a cron job and having the results emailed to you if they are non-empty.
A note, because of slave lag seeing differences in the session and search tables are fairly common.
A Small Note: You may want to define a report host in your slave and master. You do this by adding report-host=masterIP on the slave and report-host=slaveIP on the master. Where masterIP is the masters IP and likewise for the slaveIP. This will make the boxes report to each other their hostname and some other information and allows you to query "show slave hosts". While this seems semi-pointless, it ends up being quite useful.
I highly recommend using nagios to watch replication status. Having replication stop without you knowing is a huge issue and does tend happen. There is a timeout that causes replication to stop if a query stalls. This is a good thing, but can leave your slave very divergent and your users very confused. It is absolutely critical that someone be paged when this happens, so that they can assess the situation and either fail to one server or restart replication.