Re: [BDR] Best practice to automatically abort a DDL operation when one node is down

Craig Ringer <craig@xxxxxxxxxxxxxxx> · Mon, 18 Jan 2016 08:52:43 +0800

On 13 January 2016 at 21:45, Sylvain MARECHAL <marechal.sylvain2@xxxxxxxxx> wrote:

The problem is that the (1) DDL request will wait indefinitely, meaning all transactions will continue to fail until the DDL operation is manually aborted (for example, doing CTRL C in psql to abort the "CREATE TABLE").

Correct, and by design.

I'd like to do a pre-check where we sync up with the peer nodes and see if they're all alive before we take the DDL lock. This would reduce the impact a bit and allow an early ERROR like "ERROR: cannot perform DDL when one or more nodes is unreachable".

However... we have something pretty close already. You can just set a statement_timeout in the session doing the DDL. It'll cancel the operation if it takes too long.

Note that a lock_timeout will NOT work because the BDR global DDL lock is not recognised as a true lock by PostgreSQL.

What is the best practice to make sure the DDL operation will fail, possibly after a timeout, if one of the node is down?

statement_timeout

 I could check the state of the node before issuing the DDL operation, but this solution is far from being perfect as the node may fail right after this.

Correct, but it's still useful to do.

I'd check to see all nodes are connected in pg_stat_replication then I'd issue the DDL with a statement_timeout set.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services