Slony-I actually does a lot of its necessary maintenance itself, in a “cleanup” thread:
Deletes old data from various tables in the Slony-I cluster's namespace, notably entries in sl_log_1, sl_log_2 (not yet used), and sl_seqlog.
Vacuum certain tables used by Slony-I. As of 1.0.5,
this includes pg_listener
; in earlier versions, you must vacuum that
table heavily, otherwise you'll find replication slowing down because
Slony-I raises plenty of events, which leads to that table having
plenty of dead tuples.
In some versions (1.1, for sure; possibly 1.0.5) there is the
option of not bothering to vacuum any of these tables if you are using
something like pg_autovacuum to handle
vacuuming of these tables. Unfortunately, it has been quite possible
for pg_autovacuum to not vacuum quite
frequently enough, so you probably want to use the internal vacuums.
Vacuuming pg_listener
“too often” isn't nearly as
hazardous as not vacuuming it frequently enough.
Unfortunately, if you have long-running transactions, vacuums
cannot clear out dead tuples that are newer than the eldest
transaction that is still running. This will most notably lead to
pg_listener
growing large and will slow
replication.
The Duplicate Key Violation bug has helped track down some PostgreSQL race conditions.
One remaining issue is that it appears that is a case where
VACUUM
is not reclaiming space correctly, leading
to corruption of B-trees.
It may be helpful to run the command REINDEX TABLE
sl_log_1;
periodically to avoid the problem
occurring.
As of version 1.2, “log switching”
functionality is in place; every so often, it seeks to switch between
storing data in sl_log_1 and sl_log_2 so that it may seek
to TRUNCATE
the “elder” data.
That means that on a regular basis, these tables are completely cleared out, so that you will not suffer from them having grown to some significant size, due to heavy load, after which they are incapable of shrinking back down
There are a couple of “watchdog” scripts available that monitor things, and restart the slon processes should they happen to die for some reason, such as a network “glitch” that causes loss of connectivity.
You might want to run them...
The “best new way” of managing slon processes is via the combination of Section 19.2, “mkslonconf.sh”, which creates a configuration file for each node in a cluster, and Section 19.3, “ launch_clusters.sh ”, which uses those configuration files.
This approach is preferable to elder “watchdog”
systems in that you can very precisely “nail down,” in
each config file, the exact desired configuration for each node, and
not need to be concerned with what options the watchdog script may or
may not give you. This is particularly important if you are using
log shipping , where forgetting
the -a
option could ruin your log shipped node, and
thereby your whole day.
A new script for Slony-I 1.1 is generate_syncs.sh, which addresses the following kind of situation.
Supposing you have some possibly-flakey server where the slon daemon that might not run all the time, you might return from a weekend away only to discover the following situation.
On Friday night, something went “bump” and while the database came back up, none of the slon daemons survived. Your online application then saw nearly three days worth of reasonably heavy transaction load.
When you restart slon on Monday, it hasn't done a SYNC on the master since Friday, so that the next “SYNC set” comprises all of the updates between Friday and Monday. Yuck.
If you run generate_syncs.sh as a cron job every
20 minutes, it will force in a periodic SYNC
on the origin, which
means that between Friday and Monday, the numerous updates are split
into more than 100 syncs, which can be applied incrementally, making
the cleanup a lot less unpleasant.
Note that if SYNC
s are running
regularly, this script won't bother doing anything.
In the tools
directory, you may find
scripts called test_slony_state.pl
and
test_slony_state-dbi.pl
. One uses the Perl/DBI
interface; the other uses the Pg interface.
Both do essentially the same thing, namely to connect to a Slony-I node (you can pick any one), and from that, determine all the nodes in the cluster. They then run a series of queries (read only, so this should be quite safe to run) which look at the various Slony-I tables, looking for a variety of sorts of conditions suggestive of problems, including:
Bloating of tables like pg_listener, sl_log_1, sl_log_2, sl_seqlog
Listen paths
Analysis of Event propagation
Analysis of Event confirmation propagation
If communications is a little broken, replication may happen, but confirmations may not get back, which prevents nodes from clearing out old events and old replication data.
Running this once an hour or once a day can help you detect symptoms of problems early, before they lead to performance degradation.
In the directory tools
may be found four
scripts that may be used to do monitoring of Slony-I instances:
test_slony_replication
is a
Perl script to which you can pass connection information to get to a
Slony-I node. It then queries sl_path and
other information on that node in order to determine the shape of the
requested replication set.
It then injects some test queries to a test table called
slony_test
which is defined as follows, and which needs to be
added to the set of tables being replicated:
CREATE TABLE slony_test ( description text, mod_date timestamp with time zone, "_Slony-I_testcluster_rowID" bigint DEFAULT nextval('"_testcluster".sl_rowid_seq'::text) NOT NULL );
The last column in that table was defined by Slony-I as one lacking a primary key...
This script generates a line of output for each Slony-I node
that is active for the requested replication set in a file called
cluster.fact.log
.
There is an additional finalquery
option that allows
you to pass in an application-specific SQL query that can determine
something about the state of your application.
log.pm
is a Perl module that manages logging
for the Perl scripts.
run_rep_tests.sh
is a “wrapper” script
that runs test_slony_replication
.
If you have several Slony-I clusters, you might set up configuration in this file to connect to all those clusters.
nagios_slony_test
is a script that
was constructed to query the log files so that you might run the
replication tests every so often (we run them every 6 minutes), and
then a system monitoring tool such as Nagios can be set up to use this script to query the state indicated
in those logs.
It seemed rather more efficient to have a cron job run the tests and have Nagios check the results rather than having Nagios run the tests directly. The tests can exercise the whole Slony-I cluster at once rather than Nagios invoking updates over and over again.
The methodology of the previous section is designed with a view to minimizing the cost of submitting replication test queries; on a busy cluster, supporting hundreds of users, the cost associated with running a few queries is likely to be pretty irrelevant, and the setup cost to configure the tables and data injectors is pretty high.
Three other methods for analyzing the state of replication have stood out:
For an application-oriented test, it has been useful to set up a view on some frequently updated table that pulls application-specific information.
For instance, one might look either at some statistics about a most recently created application object, or an application transaction. For instance:
create view replication_test as select now() -
txn_time as age, object_name from transaction_table order by txn_time
desc limit 1;
create view replication_test as select now() -
created_on as age, object_name from object_table order by id desc
limit 1;
There is a downside: This approach requires that you have regular activity going through the system that will lead to there being new transactions on a regular basis. If something breaks down with your application, you may start getting spurious warnings about replication being behind, despite the fact that replication is working fine.
The Slony-I-defined view, sl_status
provides information as to how up to date different nodes are. Its
contents are only really interesting on origin nodes, as the events
generated on other nodes are generally ignorable.
See also the Section 5.2, “ Monitoring Slony-I using MRTG ” discussion.
slon daemons generate some more-or-less verbose log files, depending on what debugging level is turned on. You might assortedly wish to:
Use a log rotator like Apache rotatelogs to have a sequence of log files so that no one of them gets too big;
Purge out old log files, periodically.