Re: initial sync of multiple streaming slaves simultaneously

Lonni J Friedman <netllama@xxxxxxxxx> · Wed, 19 Sep 2012 12:34:20 -0700

Just curious, is there a reason why you can't use pg_basebackup ?

On Wed, Sep 19, 2012 at 12:27 PM, Mike Roest <mike.roest@xxxxxxxxxxxx> wrote:
>
>> Is there any hidden issue with this that we haven't seen.  Or does anyone
>> have suggestions as to an alternate procedure that will allow 2 slaves to
>> sync concurrently.
>>
> With some more testing I've done today I seem to have found an issue with
> this procedure.
> When the slave starts up after the sync It reaches what it thinks is a
> consistent recovery point very fast based on the pg_stop_backup
>
> eg:
> (from the recover script)
> 2012-09-19 12:15:02: pgsql_start start
> 2012-09-19 12:15:31: pg_start_backup
> 2012-09-19 12:15:31: -----------------
> 2012-09-19 12:15:31: 61/30000020
> 2012-09-19 12:15:31: (1 row)
> 2012-09-19 12:15:31:
> 2012-09-19 12:15:32: NOTICE:  pg_stop_backup complete, all required WAL
> segments have been archived
> 2012-09-19 12:15:32: pg_stop_backup
> 2012-09-19 12:15:32: ----------------
> 2012-09-19 12:15:32: 61/300000D8
> 2012-09-19 12:15:32: (1 row)
> 2012-09-19 12:15:32:
>
> While the sync was running (but after the pg_stop_backup) I pushed a bunch
> of traffic against the master server.  Which got me to a current xlog
> location of
> postgres=# select pg_current_xlog_location();
>  pg_current_xlog_location
> --------------------------
>  61/6834C450
> (1 row)
>
> The startup of the slave after the sync completed:
> 2012-09-19 12:42:49.976 MDT [18791]: [1-1] LOG:  database system was
> interrupted; last known up at 2012-09-19 12:15:31 MDT
> 2012-09-19 12:42:49.976 MDT [18791]: [2-1] LOG:  creating missing WAL
> directory "pg_xlog/archive_status"
> 2012-09-19 12:42:50.143 MDT [18791]: [3-1] LOG:  entering standby mode
> 2012-09-19 12:42:50.173 MDT [18792]: [1-1] LOG:  streaming replication
> successfully connected to primary
> 2012-09-19 12:42:50.487 MDT [18791]: [4-1] LOG:  redo starts at 61/30000020
> 2012-09-19 12:42:50.495 MDT [18791]: [5-1] LOG:  consistent recovery state
> reached at 61/31000000
> 2012-09-19 12:42:50.495 MDT [18767]: [2-1] LOG:  database system is ready to
> accept read only connections
>
> It shows the DB reached a consistent state as of 61/31000000 which is well
> behind the current location of the master (and the data files that were
> synced over to the slave).  And monitoring the server showed the expected
> slave delay that disappeared as the slave pulled and recovered from the WAL
> files that go generated after the pg_stop_backup.
>
> But based on this it looks like this procedure would end up with a
> indeterminate amount of time (based on how much traffic the master processed
> while the slave was syncing) that the slave couldn't be trusted for fail
> over or querying as the server is up and running but is not actually in a
> consistent state.
>
> Thinking it through the more complicated script version of the 2 server
> recovery (where first past the post to run start_backup or stop_backup)
> would also have this issue (although our failover slave would always be the
> one running stop backup as it syncs faster so at least it would be always
> consistent but the DR would still have the problem)

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general