I am experiencing two problems
1) pg_stop_backup hangs until pg_basebackup finishes
1) pg_stop_backup hangs until pg_basebackup finishes
2) WAL contains references to invalid pages
(I suspect when 1. is fixed I won't experience 2. )
Full description below:
postgresql: 9.3.13
My situation is I create a writable report database every day using pg_start_backup() / rsync / pg_stop_backup()
(whole process takes 1-2 hours, the rsync actually copies the data dir of a slave/warm spare DB)
Also once a week I create a database for backup/archive purposes using pg_basebackup
(whole process takes about 13 hours)
These two processes used to be able to coincide until recently.
Recent changes include a major debian upgrade and a minor version of postgres upgrade from 9.3.10.
Recent changes include a major debian upgrade and a minor version of postgres upgrade from 9.3.10.
Now when the report sync occurs during the pg_basebackup (w/ --xlog-method=stream option) the pg_stop_backup() hangs until
the the pg_basebackup has completed (4 hours later).
A labeled WAL backup file is created when the pg_stop_backup() is first executed and another
is created when the pg_basebackup completes.
While the pg_stop_backup() hangs the following appears in the logs:
2016-06-11 07:50:45 EDT: WARNING: pg_stop_backup still waiting for all required WAL segments to be archived (7680 seconds elapsed)
2016-06-11 07:50:45 EDT: HINT: Check that your archive_command is executing properly. pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
2016-06-11 07:50:45 EDT: HINT: Check that your archive_command is executing properly. pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
until eventually:
2016-06-11 12:59:07 EDT:LOG: duration: 26190082.497 ms statement: SELECT pg_stop_backup()
The archive command appears to work as WAL files are being archived successfully with timestamps that occur while pg_stop_backup is hanging.
-rw------- 1 postgres postgres 314 Jun 11 12:56 000000030000109200000066.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 12:56 000000030000109300000029
-rw------- 1 postgres postgres 16M Jun 11 12:54 000000030000109300000028
-rw------- 1 postgres postgres 16M Jun 11 12:52 000000030000109300000027
-rw------- 1 postgres postgres 314 Jun 11 12:56 000000030000109200000066.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 12:56 000000030000109300000029
-rw------- 1 postgres postgres 16M Jun 11 12:54 000000030000109300000028
-rw------- 1 postgres postgres 16M Jun 11 12:52 000000030000109300000027
<snip>
-rw------- 1 postgres postgres 16M Jun 11 06:22 0000000300001092000000F1
-rw------- 1 postgres postgres 335 Jun 11 05:42 0000000300001092000000D4.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 05:42 0000000300001092000000F0
-rw------- 1 postgres postgres 16M Jun 11 06:22 0000000300001092000000F1
-rw------- 1 postgres postgres 335 Jun 11 05:42 0000000300001092000000D4.00000028.backup
-rw------- 1 postgres postgres 16M Jun 11 05:42 0000000300001092000000F0
Content of 0000000300001092000000D4.00000028.backup:
START WAL LOCATION: 1092/D4000028 (file 0000000300001092000000D4)
STOP WAL LOCATION: 1092/F008EA10 (file 0000000300001092000000F0)
CHECKPOINT LOCATION: 1092/D4004E08
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-11 04:51:04 EDT
LABEL: reporting 2016-06-11 04:51:03.892804-04
STOP TIME: 2016-06-11 05:42:37 EDT
START WAL LOCATION: 1092/D4000028 (file 0000000300001092000000D4)
STOP WAL LOCATION: 1092/F008EA10 (file 0000000300001092000000F0)
CHECKPOINT LOCATION: 1092/D4004E08
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-11 04:51:04 EDT
LABEL: reporting 2016-06-11 04:51:03.892804-04
STOP TIME: 2016-06-11 05:42:37 EDT
Content of 000000030000109200000066.00000028.backup:
START WAL LOCATION: 1092/66000028 (file 000000030000109200000066)
STOP WAL LOCATION: 1093/29555228 (file 000000030000109300000029)
CHECKPOINT LOCATION: 1092/66005BF8
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2016-06-11 00:04:51 EDT
LABEL: pg_basebackup base backup
STOP TIME: 2016-06-11 12:56:28 EDT
START WAL LOCATION: 1092/66000028 (file 000000030000109200000066)
STOP WAL LOCATION: 1093/29555228 (file 000000030000109300000029)
CHECKPOINT LOCATION: 1092/66005BF8
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2016-06-11 00:04:51 EDT
LABEL: pg_basebackup base backup
STOP TIME: 2016-06-11 12:56:28 EDT
Then when the report database finally starts and begins to read in all the archived WAL files it finally fails with the following log msgs:
2016-06-11 13:01:17 EDT:(): LOG: restored log file "000000030000109300000015" from archive
2016-06-11 13:01:18 EDT:(): LOG: restored log file "000000030000109300000016" from archive
2016-06-11 13:01:18 EDT:(): WARNING: page 32474 of relation base/16430/8122214 is uninitialized
2016-06-11 13:01:18 EDT:(): CONTEXT: xlog redo visible: rel 1663/16430/8122214; blk 32474
2016-06-11 13:01:18 EDT:(): PANIC: WAL contains references to invalid pages
2016-06-11 13:01:18 EDT:(): CONTEXT: xlog redo visible: rel 1663/16430/8122214; blk 32474
2016-06-11 13:01:18 EDT:(): LOG: startup process (PID 17933) was terminated by signal 6: Aborted
2016-06-11 13:01:18 EDT:(): LOG: terminating any other active server processes
2016-06-11 13:01:17 EDT:(): LOG: restored log file "000000030000109300000015" from archive
2016-06-11 13:01:18 EDT:(): LOG: restored log file "000000030000109300000016" from archive
2016-06-11 13:01:18 EDT:(): WARNING: page 32474 of relation base/16430/8122214 is uninitialized
2016-06-11 13:01:18 EDT:(): CONTEXT: xlog redo visible: rel 1663/16430/8122214; blk 32474
2016-06-11 13:01:18 EDT:(): PANIC: WAL contains references to invalid pages
2016-06-11 13:01:18 EDT:(): CONTEXT: xlog redo visible: rel 1663/16430/8122214; blk 32474
2016-06-11 13:01:18 EDT:(): LOG: startup process (PID 17933) was terminated by signal 6: Aborted
2016-06-11 13:01:18 EDT:(): LOG: terminating any other active server processes
While the PANIC is very real problem I suspect it will not occur if the stop_backup() stops hanging, since this does not occur
during the syncs from the rest of the week.
Should I be be able to use those two styles of backup at the same time?
Should I replace the start_backup/stop_backup with separate calls to create a checkpoint and switch transaction log files?
Any guidance is appreciated.
Best,
Alex