Re: [HACKERS] Check that streaming replica received all data after master shutdown

Sameer Kumar <sameer.kumar@xxxxxxxxxx> · Fri, 16 Jan 2015 12:16:28 +0800

On Thu, Jan 15, 2015 at 6:19 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@xxxxxxxxxxxxx> wrote:
> On Wed, Jan 14, 2015 at 2:11 AM, Heikki Linnakangas <hlinnakangas@xxxxxxxxxx

> > wrote:

>

> > On 01/13/2015 12:11 PM, Vladimir Borodin wrote:

> >

> >>

> >> 05 янв. 2015 г., в 18:15, Vladimir Borodin <root@xxxxxxxxxxx> написал(а):

> >>

> >>  Hi all.

> >>>

> >>> I have a simple script for planned switchover of PostgreSQL (9.3 and

> >>> 9.4) master to one of its replicas. This script checks a lot of things

> >>> before doing it and one of them is that all data from master has been

> >>> received by replica that is going to be promoted. Right now the check is

> >>> done like below:

> >>>

> >>> On the master:

> >>>

> >>> postgres@pgtest03d ~ $ psql -t -A -c 'select

> >>> pg_current_xlog_location();'

> >>> 0/33000090

> >>> postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast

> >>> waiting for server to shut down.... done

> >>> server stopped

> >>> postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head

> >>> pg_control version number:            937

> >>> Catalog version number:               201306121

> >>> Database system identifier:           6061800518091528182

> >>> Database cluster state:               shut down

> >>> pg_control last modified:             Mon 05 Jan 2015 06:47:57 PM MSK

> >>> Latest checkpoint location:           0/34000028

> >>> Prior checkpoint location:            0/33000028

> >>> Latest checkpoint's REDO location:    0/34000028

> >>> Latest checkpoint's REDO WAL file:    0000001B0000000000000034

> >>> Latest checkpoint's TimeLineID:       27

> >>> postgres@pgtest03d ~ $

> >>>

> >>> On the replica (after shutdown of master):

> >>>

> >>> postgres@pgtest03g ~ $ psql -t -A -c "select

> >>> pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/34000028');"

> >>> 104

> >>> postgres@pgtest03g ~ $

> >>>

> >>> These 104 bytes seems to be the size of shutdown checkpoint record (as I

> >>> can understand from pg_xlogdump output).

> >>>

> >>> postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump

> >>> -s 0/33000090 -t 27

> >>> rmgr: XLOG        len (rec/tot):      0/    32, tx:          0, lsn:

> >>> 0/33000090, prev 0/33000028, bkp: 0000, desc: xlog switch

> >>> rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn:

> >>> 0/34000028, prev 0/33000090, bkp: 0000, desc: checkpoint: redo 0/34000028;

> >>> tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0;

> >>> oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;

> >>> shutdown

> >>> pg_xlogdump: FATAL:  error in WAL record at 0/34000028: record with zero

> >>> length at 0/34000090

> >>>

> >>> postgres@pgtest03g ~/9.3/data/pg_xlog $

> >>>

> >>> I’m not sure that these 104 bytes will always be 104 bytes to have a

> >>> strict equality while checking. Could it change in the future? Or is there

> >>> a better way to understand that streaming replica received all data after

> >>> master shutdown? The check that pg_xlog_location_diff returns 104 bytes

> >>> seems a bit strange.

> >>>

> >>

> > Don't rely on it being 104 bytes. It can vary across versions, and across

> > different architectures.

> >

> > You could simply check that the standby's pg_last_xlog_replay_location() >

> > master's "Latest checkpoint location", and not care about the exact

> > difference.

> >

>

>  I believe there were some changes made in v9.3 which will wait for pending

> WALs to be replicated before a fast and smart shutdown (of master) can

> close the replication connection.

>

> http://git.postgresql.org/pg/commitdiff/985bd7d49726c9f178558491d31a570d47340459

I don't understand the relation between it and 104 bytes, it says

that the change is backpatched up to 9.1. Since it assures all

xlog records to be transferred if no trouble happens. Relying on

the mechanism, you don't need to check that if master is known to

have gracefully shut down and had no trouble around the

environment. Judging from that you want this check, I suppose

you're not guaranteed not to have trouble or not trusting the

mechanism itself.

Right! I was coming from the point that if master has shutdown gracefully then you don't really need to worry about ensuring with such checks on Standby (it is supposed to get the pending WAL before master goes down. 

This obviously (as rightly pointed out by you), would not work if master has not shutdown gracefully or if there is a connection issue between master and slave while master is being shutdown (even if it is smart or fast shutdown). 

Given the condition, as Alvaro said upthread, verifying that the

last record is a shutdown checkpoint should raise a lot the

chance for the all record being received except for the exteme

case such that the master have upped and downed while replication

connection cannot be made. 

I am not sure if this would cover the cases where the master has gone down abruptly or has crashed (or the service has been killed).

For the case, I think there's no means

to confirm that by standby alone, you should at least compare the

next LSN to the last xlog record with the old master by any

means.

That is the method that occurred to me as well while reading the first part of your comments. :)

Or doing any sanity check of the database on the standby

utilizing the nature of the data instead?

Best Regards,
Sameer Kumar | Database Consultant
ASHNIK PTE. LTD.
101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533
M: +65 8110 0350  T: +65 6438 3504 | www.ashnik.com

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).