Re: Standby stopped working after PANIC: WAL contains references to invalid pages

Dan Kogan <dan@xxxxxxxxxx> · Mon, 24 Jun 2013 09:44:09 -0400

We have backed up $PGDATA, but had to re-initialize the slave.
We also have the WALs from the day this happened.

Thanks,
Dan

-----Original Message-----
From: Lonni J Friedman [mailto:netllama@xxxxxxxxx] 
Sent: Saturday, June 22, 2013 10:09 PM
To: Dan Kogan
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  Standby stopped working after PANIC: WAL contains references to invalid pages

Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there might be a way to debug it further.  I'd guess that something like bad RAM or storage could cause an index to get corrupted in this fashion, but the fact that you're using AWS makes that less likely.  Someone far more knowledgeable than I will need to provide guidance on how to debug this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan@xxxxxxxxxx> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index.  We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama@xxxxxxxxx]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general@xxxxxxxxxxxxxx
> Subject: Re:  Standby stopped working after PANIC: WAL 
> contains references to invalid pages
>
> Looks like some kind of data corruption.  Question is whether it came from the master, or was created by the standby.  If you re-seed the standby with a full (base) backup, does the problem go away?
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan@xxxxxxxxxx> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING:  page 158130 of 
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC:  WAL contains 
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT:  xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG:  startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG:  terminating any other 
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug - 
>> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it 
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the 
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc 
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command          rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode   on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers        6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments   128
>>
>> checkpoint_timeout       30min
>>
>> default_text_search_config       pg_catalog.english
>>
>> hot_standby      on
>>
>> lc_messages      en_US.UTF-8
>>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general