Just curious, is there a reason why you can't use pg_basebackup ? On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest <mike.roest@xxxxxxxxxxxx> wrote: > >> Is there any hidden issue with this that we haven't seen. Or does anyone >> have suggestions as to an alternate procedure that will allow 2 slaves to >> sync concurrently. >> > With some more testing I've done today I seem to have found an issue with > this procedure. > When the slave starts up after the sync It reaches what it thinks is a > consistent recovery point very fast based on the pg_stop_backup > > eg: > (from the recover script) > 2012-09-19 12:15:02: pgsql_start start > 2012-09-19 12:15:31: pg_start_backup > 2012-09-19 12:15:31: ----------------- > 2012-09-19 12:15:31: 61/30000020 > 2012-09-19 12:15:31: (1 row) > 2012-09-19 12:15:31: > 2012-09-19 12:15:32: NOTICE: pg_stop_backup complete, all required WAL > segments have been archived > 2012-09-19 12:15:32: pg_stop_backup > 2012-09-19 12:15:32: ---------------- > 2012-09-19 12:15:32: 61/300000D8 > 2012-09-19 12:15:32: (1 row) > 2012-09-19 12:15:32: > > While the sync was running (but after the pg_stop_backup) I pushed a bunch > of traffic against the master server. Which got me to a current xlog > location of > postgres=# select pg_current_xlog_location(); > pg_current_xlog_location > -------------------------- > 61/6834C450 > (1 row) > > The startup of the slave after the sync completed: > 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG: database system was > interrupted; last known up at 2012-09-19 12:15:31 MDT > 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG: creating missing WAL > directory "pg_xlog/archive_status" > 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG: entering standby mode > 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG: streaming replication > successfully connected to primary > 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG: redo starts at 61/30000020 > 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG: consistent recovery state > reached at 61/31000000 > 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG: database system is ready to > accept read only connections > > It shows the DB reached a consistent state as of 61/31000000 which is well > behind the current location of the master (and the data files that were > synced over to the slave). And monitoring the server showed the expected > slave delay that disappeared as the slave pulled and recovered from the WAL > files that go generated after the pg_stop_backup. > > But based on this it looks like this procedure would end up with a > indeterminate amount of time (based on how much traffic the master processed > while the slave was syncing) that the slave couldn't be trusted for fail > over or querying as the server is up and running but is not actually in a > consistent state. > > Thinking it through the more complicated script version of the 2 server > recovery (where first past the post to run start_backup or stop_backup) > would also have this issue (although our failover slave would always be the > one running stop backup as it syncs faster so at least it would be always > consistent but the DR would still have the problem) -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general