Re: Strange replication problem - segment restored from archive but still requested from master

Venkata Balaji N <nag1010@xxxxxxxxx> · Fri, 22 May 2015 10:16:27 +1000

On Fri, May 22, 2015 at 6:52 AM, Piotr Gasidło <quaker@xxxxxxxxxxxxxx> wrote:
Got strange problem. Unable to repeat, but got logs.

Simple master-slave using streaming replication.

Master is running. Slave is down.

Segment 0000000400004C4D00000090 was successfully archived and send

from master to slave.

Now I've started slave, and:

ay 21 21:23:37 d8 postgres[50645]: [3-1] 2015-05-21 21:23:37.033 CEST

@ 50645   LOG:  database system was shut down in recovery at

2015-05-21 21:22:03 CEST

May 21 21:23:37 d8 postgres[50645]: [4-1] 2015-05-21 21:23:37.034 CEST

@ 50645   LOG:  entering standby mode

May 21 21:23:37 d8 postgres[50645]: [5-1] 2015-05-21 21:23:37.058 CEST

@ 50645   LOG:  restored log file "0000000400004C4D00000088" from

archive

May 21 21:23:37 d8 postgres[50645]: [6-1] 2015-05-21 21:23:37.120 CEST

@ 50645   LOG:  redo starts at 4C4D/88493B50

May 21 21:23:37 d8 postgres[50645]: [7-1] 2015-05-21 21:23:37.226 CEST

@ 50645   LOG:  restored log file "0000000400004C4D00000089" from

archive

May 21 21:23:37 d8 postgres[50645]: [8-1] 2015-05-21 21:23:37.426 CEST

@ 50645   LOG:  restored log file "0000000400004C4D0000008A" from

archive

May 21 21:23:37 d8 postgres[50645]: [9-1] 2015-05-21 21:23:37.750 CEST

@ 50645   LOG:  restored log file "0000000400004C4D0000008B" from

archive

May 21 21:23:38 d8 postgres[50645]: [10-1] 2015-05-21 21:23:38.376

CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008C" from

archive

May 21 21:23:38 d8 postgres[50645]: [11-1] 2015-05-21 21:23:38.690

CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008D" from

archive

May 21 21:23:38 d8 postgres[50645]: [12-1] 2015-05-21 21:23:38.855

CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008E" from

archive

May 21 21:23:39 d8 postgres[50645]: [13-1] 2015-05-21 21:23:39.275

CEST @ 50645   LOG:  restored log file "0000000400004C4D0000008F" from

archive

May 21 21:23:39 d8 postgres[50645]: [14-1] 2015-05-21 21:23:39.654

CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from

archive

May 21 21:23:40 d8 postgres[50645]: [15-1] 2015-05-21 21:23:40.222

CEST @ 50645   LOG:  consistent recovery state reached at

4C4D/90FFF9C8

May 21 21:23:40 d8 postgres[50644]: [3-1] 2015-05-21 21:23:40.222 CEST

@ 50644   LOG:  database system is ready to accept read only

connections

May 21 21:23:40 d8 postgres[50645]: [16-1] 2015-05-21 21:23:40.223

CEST @ 50645   LOG:  unexpected pageaddr 4C46/E000000 in log segment

0000000400004C4D00000091, offset 0

May 21 21:23:40 d8 postgres[50699]: [4-1] 2015-05-21 21:23:40.232 CEST

@ 50699   LOG:  started streaming WAL from primary at 4C4D/90000000 on

timeline 4

May 21 21:23:40 d8 postgres[50699]: [5-1] 2015-05-21 21:23:40.232 CEST

@ 50699   FATAL:  could not receive data from WAL stream: ERROR:

requested WAL segment 0000000400004C4D00000090 has already been

removed

May 21 21:23:40 d8 postgres[50645]: [17-1] 2015-05-21 21:23:40.255

CEST @ 50645   LOG:  restored log file "0000000400004C4D00000090" from

archive

May 21 21:23:40 d8 postgres[50703]: [4-1] 2015-05-21 21:23:40.268 CEST

@ 50703   LOG:  started streaming WAL from primary at 4C4D/90000000 on

timeline 4

May 21 21:23:40 d8 postgres[50703]: [5-1] 2015-05-21 21:23:40.268 CEST

@ 50703   FATAL:  could not receive data from WAL stream: ERROR:

requested WAL segment 0000000400004C4D00000090 has already been

removed

May 21 21:23:40 d8 postgres[50703]: [5-2]

...

(and so on)

So, as I understand:

- slave was started and entered restore,

- slave restored 0000000400004C4D00000090 from archive, reached

consistent recovery state

- now, it connected to master and noticed, that, it has

0000000400004C4D00000091 segment uncomplete

- and then, started yelling about missing segment (master deleted it

already after archiving) on master

Why?

It might be yelling about the WAL segment due to the delay in shipping it from master to slave.
Do you have the restore_command set up in the recovery.conf file ? do you have any automated job which is shipping WAL archives from master to slave.

Slave - if it does not find the WAL segment in master pg_xlog location, it looks for the same in the location mentioned in the restore_command. It yells if it is unable to find it.

Cannot give any concrete comments/suggestions, until recovery.conf entries are known.

I fixed it by stopping slave, doing some operations on master, copying

0000000400004C4D00000091 and 0000000400004C4D00000092 from master, and

staring the slave:

May 21 21:24:56 d8 postgres[50644]: [4-1] 2015-05-21 21:24:56.160 CEST

@ 50644   LOG:  received fast shutdown request

May 21 21:24:56 d8 postgres[50644]: [5-1] 2015-05-21 21:24:56.160 CEST

@ 50644   LOG:  aborting any active transactions

May 21 21:24:56 d8 postgres[50657]: [3-1] 2015-05-21 21:24:56.162 CEST

@ 50657   LOG:  shutting down

May 21 21:24:56 d8 postgres[50657]: [4-1] 2015-05-21 21:24:56.174 CEST

@ 50657   LOG:  database system is shut down

May 21 21:28:54 d8 postgres[53574]: [1-1] 2015-05-21 21:28:54.781 CEST

@ 53574   LOG:  loaded library "pg_stat_statements"

May 21 21:28:55 d8 postgres[53574]: [2-1] 2015-05-21 21:28:55.123 CEST

@ 53574   LOG:  ending log output to stderr

May 21 21:28:55 d8 postgres[53574]: [2-2] 2015-05-21 21:28:55.123 CEST

@ 53574   HINT:  Future log output will go to log destination

"syslog".

May 21 21:28:55 d8 postgres[53575]: [3-1] 2015-05-21 21:28:55.124 CEST

@ 53575   LOG:  database system was shut down in recovery at

2015-05-21 21:24:56 CEST

May 21 21:28:55 d8 postgres[53575]: [4-1] 2015-05-21 21:28:55.126 CEST

@ 53575   LOG:  entering standby mode

May 21 21:28:55 d8 postgres[53575]: [5-1] 2015-05-21 21:28:55.150 CEST

@ 53575   LOG:  restored log file "0000000400004C4D00000088" from

archive

May 21 21:28:55 d8 postgres[53575]: [6-1] 2015-05-21 21:28:55.209 CEST

@ 53575   LOG:  redo starts at 4C4D/88493B50

May 21 21:28:55 d8 postgres[53575]: [7-1] 2015-05-21 21:28:55.314 CEST

@ 53575   LOG:  restored log file "0000000400004C4D00000089" from

archive

May 21 21:28:55 d8 postgres[53575]: [8-1] 2015-05-21 21:28:55.512 CEST

@ 53575   LOG:  restored log file "0000000400004C4D0000008A" from

archive

May 21 21:28:55 d8 postgres[53575]: [9-1] 2015-05-21 21:28:55.831 CEST

@ 53575   LOG:  restored log file "0000000400004C4D0000008B" from

archive

May 21 21:28:56 d8 postgres[53575]: [10-1] 2015-05-21 21:28:56.508

CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008C" from

archive

May 21 21:28:56 d8 postgres[53575]: [11-1] 2015-05-21 21:28:56.818

CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008D" from

archive

May 21 21:28:56 d8 postgres[53575]: [12-1] 2015-05-21 21:28:56.982

CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008E" from

archive

May 21 21:28:57 d8 postgres[53575]: [13-1] 2015-05-21 21:28:57.415

CEST @ 53575   LOG:  restored log file "0000000400004C4D0000008F" from

archive

May 21 21:28:57 d8 postgres[53575]: [14-1] 2015-05-21 21:28:57.790

CEST @ 53575   LOG:  restored log file "0000000400004C4D00000090" from

archive

May 21 21:28:58 d8 postgres[53575]: [15-1] 2015-05-21 21:28:58.328

CEST @ 53575   LOG:  consistent recovery state reached at

4C4D/90FFF9C8

May 21 21:28:58 d8 postgres[53574]: [3-1] 2015-05-21 21:28:58.328 CEST

@ 53574   LOG:  database system is ready to accept read only

connections

May 21 21:28:58 d8 postgres[53575]: [16-1] 2015-05-21 21:28:58.349

CEST @ 53575   LOG:  restored log file "0000000400004C4D00000091" from

archive

May 21 21:28:59 d8 postgres[53575]: [17-1] 2015-05-21 21:28:59.962

CEST @ 53575   LOG:  restored log file "0000000400004C4D00000092" from

archive

May 21 21:29:00 d8 postgres[53575]: [18-1] 2015-05-21 21:29:00.037

CEST @ 53575   LOG:  unexpected pageaddr 4C45/DC000000 in log segment

0000000400004C4D00000093, offset 0

May 21 21:29:00 d8 postgres[53640]: [4-1] 2015-05-21 21:29:00.047 CEST

@ 53640   LOG:  started streaming WAL from primary at 4C4D/92000000 on

timeline 4

What was wrong?

This error message can be ignored, did you try shipping 0000000400004C4D00000093 to slave from master's archive location and see if slave picked it up and applied ?

Regards,
Venkata B N

Fujitsu Australia