One of the usual merits the use, by PostgreSQL, of Multi-Version Concurrency Control (MVCC) is that this eliminates a whole host of reasons to need to lock database objects. On some other database systems, you need to acquire a table lock in order to insert data into the table; that can severely hinder performance. On other systems, read locks can impede writes; with MVCC, PostgreSQL eliminates that whole class of locks in that “old reads” can access “old tuples.” Most of the time, this allows the gentle user of PostgreSQL to not need to worry very much about locks.
Unfortunately, there are several sorts of Slony-I events that do require exclusive locks on PostgreSQL tables, with the result that modifying Slony-I configuration can bring back some of those “locking irritations.” In particular:
A momentary exclusive table lock must be acquired on the “origin” node in order to add the trigger that collects updates for that table. It only needs to be acquired long enough to establish the new trigger.
When a set origin is shifted from one node to another, exclusive locks must be acquired on each replicated table on both the old origin and the new origin in order to change the triggers on the tables.
This operation expressly requests locks on each of the tables in a given replication set on the origin node.
This operation runs a set of SQL queries; in order for it to work, the Slony-I triggers must be removed, followed by the query (which potentially updates the data) running, followed by triggers being restored. The operation therefore must acquire table locks on all replicated tables on each node.
During the SUBSCRIBE_SET
event on
a new subscriber
In a sense, this is the least provocative scenario, since, before the replication set has been populated, it is pretty reasonable to say that the node is “unusable” and that Slony-I could reasonably demand exclusive access to the node.
A change in version 1.2 is that an express LOCK
TABLE
SQL request is submitted in the loop that validates
that all of the tables are there. This means that
all tables in the replication set will be locked
via an exclusive lock for the entire duration of the process of
subscription. By locking the tables early, this means that the
subscription cannot fail after copying some of the data due to some
other process having held on to a table.
In any case, note that this one began with the wording “on a new subscriber.” The locks are applied on the new subscriber. They are not applied on the provider or on the origin.
pg_autovacuum may not be
part of Slony-I, but those that run it find that it wakes up roughly
once per minute and may, at any time, start vacuuming a table, thereby
taking out a ShareUpdateExclusiveLock
lock. This may
block the other events for an unpredictable period of time.
Each of these actions requires, at some point, modifying each of the tables in the affected replication set, which requires acquiring an exclusive lock on the table. Some users that have tried running these operations on Slony-I nodes that were actively servicing applications have experienced difficulties with deadlocks and/or with the operations hanging up.
The obvious question: “What to do about such deadlocks?”
Several possibilities admit themselves:
Announce an application outage to avoid deadlocks
If you can temporarily block applications from using the database, that will provide a window of time during which there is nothing running against the database other than administrative processes under your control.
Try the operation, hoping for things to work
Since nothing prevents applications from leaving access locks in your way, you may find yourself deadlocked. But if the number of remaining locks are small, you may be able to negotiate with users to “get in edgewise.”
Use pgpool
If you can use this or some similar “connection broker”, you may be able to tell the connection manager to stop using the database for a little while, thereby letting it “block” the applications for you. What would be ideal would be for the connection manager to hold up user queries for a little while so that the brief database outage looks, to them, like a period where things were running slowly.
Rapid Outage Management
The following procedure may minimize the period of the outage:
Modify pg_hba.conf
so that only
the slony
user
will have access to the database.
Issue a kill -SIGHUP
to the
PostgreSQL postmaster.
This will not kill off existing possibly-long-running queries, but will prevent new ones from coming in. There is an application impact in that incoming queries will be rejected until the end of the process.
If “all looks good,” then it should be safe to proceed with the Slony-I operation.
If some old query is lingering around, you may need
to kill -SIGQUIT
one of the PostgreSQL processes.
This will restart the backend and kill off any lingering queries. You
probably need to restart the slon processes that
attach to the node.
At that point, it will be safe to proceed with the Slony-I operation; there will be no competing processes.
Reset pg_hba.conf
to allow other
users in, and kill -SIGHUP
the postmaster to make
it reload the security configuration.
The section Section 15, “Database Schema Changes (DDL)” suggests some additional techniques that may be useful, such as moving tables between replication sets in such a way that you minimize the set of tables that need to be locked.
Regrettably, there is no perfect answer to this. If it is necessary to submit a MOVE SET request, then it is presumably necessary to accept the brief application outage. As Slony-I/ pgpool linkages improve, that may become a better way to handle this.