Re: Postgres WAL Recovery Fails... And Then Works...

Lonni J Friedman <netllama@xxxxxxxxx> · Mon, 14 Jan 2013 12:02:34 -0800

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@xxxxxxxxxxxxxx

Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil@xxxxxxxxxxxxxx> wrote:
>
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database
> back online and ran into a interesting scenario. After copying the data
> directory, setting up the recovery.conf, and starting the slave database,
> the database crashes while replaying xlogs. However, trying to start the
> database again, the database is able to replay xlogs farther than it
> initially got, but ultimately ended up failing out again. After starting the
> DB a third time, PostgreSQL replays even further and catches up to the
> master to start streaming replication. Is this common and or acceptable?
>
> The error messages that cause PostgreSQL to stop replaying xlogs:
>
> ERROR #1:
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
>
> ERROR #2:
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
>
>
> Fortunately, these errors only pertain to indexes, which can be rebuilt.
> But is this a sign that the data directory on the slave is corrupt?
>
> Here is a full timeline of the logs highlighting the issues. :
>
> 1. Data Directory Copy Finishes.
> 2. Recovery.conf Setup
>
> 3. Start #1:
>
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
> recovery at 2013-01-12 00:14:06 UTC
> 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
> incomplete startup packet
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
> reached at 8E45/8B174840
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
> database system is starting up
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008C" from archive
> 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008D" from archive
>
> *SNIP*
>
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
> server processes
>
>
> 4. PostgreSQL shuts down...
> 5. Debugging logs enabled in postgresql.conf.
>
>
> 6. Start #2:
>
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 18:05:31 UTC
> 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
> incomplete startup packet
> 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
> 8E45/B80AF650
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
> 8E45/8B173180; shutdown FALSE
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
> 0/552803703; next OID: 24427698
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
> MultiXactOffset: 2442921
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008C" from archive
>
> *SNIP*
>
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000062" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000063" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000064" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000065" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
> reached at 8E48/67AC4E28
> 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000068" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000069" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006A" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006B" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006C" from archive
>
> *SNIP*
>
> 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
> server processes
>
>
> 7. PostgreSQL shuts down...
>
>
> 8. Start #3:
>
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 19:50:31 UTC
> 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
> incomplete startup packet
> 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000039" from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
> 8E4A/39CD4BA0
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
> 8E4A/19F0D210; shutdown FALSE
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
> 0/552859005; next OID: 24427698
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
> MultiXactOffset: 2443321
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000019" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A0000001A" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>
> *SNIP*
>
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000077" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000078" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
> reached at 8E4F/7A22BD08
> 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007B" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007C" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007D" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007E" from archive
>
>
> *SNIP*
>
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000D9" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DA" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DB" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DC" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DD" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
> in log file 36435, segment 222, offset 0
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
> successfully connected to primary
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
> file=base/16408/22873432 time=2.538 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
> file=base/16408/18967506 time=12.054 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
> file=base/16408/18967506_fsm time=0.095 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
> file=base/16408/22873244 time=0.144 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
> file=base/16408/22892823 time=0.087 msec
>
> 9. Slave DB connected to streaming replication with Master DB and all
> seems fine.
>
>
> Any help would be appreciated. Thanks!

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin