pg cluster not cleaning up after failover

Peter Brunnengräber <pbrunnen@xxxxxxxxxxxxx> · Wed, 13 Jul 2016 11:58:50 -0400 (EDT)

Hello all,
  I'm having an issue with a postgresql 9.2 cluster during failover and hope you all can help.  I have been attempting to follow the guide provided at ClusterLabs(1) but not having much luck and I don't quite understand where the issue is.  I'm running on debian wheezy.

  I have my crm_mon output below.  One server is PRI and operating normally after taking over.  I have pg setup to do the wal archiving via rsync to the opposite node.  <archive_command = 'rsync -a %p test-node2:/db/data/postgresql/9.2/pg_archive/%f'>  The rsync is working and I do see WAL files going to the other host appropriately.

  Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped.  The WAL files are arriving from node1 on node2.  I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup restore from node1.  This all went well without error in the node1 postgresql log.

  After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone.  Plus I don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's master-baseline of 0000001EB2000080.  I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to always happen for the current xlog filename.

  And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary" and the pg_stat_replication query on node1 shows connected, but ASYNC.

Any ideas?

Very much appreciated!
-With kind regards,
 Peter Brunnengräber

References:
(1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over

###
============
Last updated: Wed Jul 13 14:51:53 2016
Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
Stack: openais
Current DC: test-node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ test-node1 test-node2 ]

Full list of resources:

 Resource Group: g_master
     ClusterIP-Net1     (ocf::heartbeat:IPaddr2):       Started test-node1
     ReplicationIP-Net2 (ocf::heartbeat:IPaddr2):       Started test-node1
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ test-node1 ]
     Slaves: [ test-node2 ]

Node Attributes:
* Node test-node1:
    + master-pgsql:0                    : 1000
    + master-pgsql:1                    : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000001EB2000080
    + pgsql-status                      : PRI
* Node test-node2:
    + master-pgsql:0                    : -INFINITY
    + master-pgsql:1                    : -INFINITY
    + pgsql-data-status                 : LATEST
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 0000001EB9053DD8

Migration summary:
* Node test-node2:
* Node test-node1:

#### Node2
2016-07-13 14:55:09 UTC LOG:  database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
2016-07-13 14:55:09 UTC LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  entering standby mode
2016-07-13 14:55:09 UTC LOG:  restored log file "000000010000001E000000BA" from archive
2016-07-13 14:55:09 UTC FATAL:  the database system is starting up
2016-07-13 14:55:09 UTC LOG:  redo starts at 1E/BA000020
2016-07-13 14:55:09 UTC LOG:  consistent recovery state reached at 1E/BA05FED8
2016-07-13 14:55:09 UTC LOG:  database system is ready to accept read only connections
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  streaming replication successfully connected to primary

#### Node1
postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
+------------------+-----------+-------+
| application_name |   upper   | upper |
+------------------+-----------+-------+
| test-node2       | STREAMING | ASYNC |
+------------------+-----------+-------+
(1 row)

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin