Re: bug, bad memory, or bad disk?

Merlin Moncure <mmoncure@xxxxxxxxx> · Fri, 15 Feb 2013 08:26:12 -0600

On Fri, Feb 15, 2013 at 8:08 AM, Amit Kapila <amit.kapila@xxxxxxxxxx> wrote:
> On Friday, February 15, 2013 1:33 AM Ben Chobot wrote:
>
>> 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1]
>  ERROR:  invalid page header in block 2948 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641
>> 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1]  ERROR:
>  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1]
>  ERROR:  invalid memory alloc request size
>> 1968078400
>> 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1]
>  ERROR:  could not open file
>> "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block
> 3936767042): No such file or directory
>> 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1]
>  ERROR:  invalid page header in block 38887 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627
>> 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1]
>  ERROR:  invalid page header in block 2368 of
>> relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945
>
>> There didn't seem to be much correlation to which files were affected, and
> this was a critical server, so once we
>> realized a simple reindex wasn't going to solve things, we shut it down
> and brought up a slave as the new master db.
>
>> While that seemed to fix these issues, we soon noticed problems with
> missing clog files. The missing clogs were outside > the range of the
> existing clogs, so we tried using dummy clog files. It didn't help, and
> running pg_check we found that > one block of one table was definitely
> corrupt. Worse, that corruption had spread to all our replicas.
>
> Can you check that corrupted block is from one of the relations mentioned in
> your errors. This is just to reconfirm.
>
>> I know this is a little sparse on details, but my questions are:
>
>> 1. What kind of fault should I be looking to fix? Because it spread to all
> the replicas, both those that stream and
>> those that replicate by replaying wals in the wal archive, I assume it's
> not a storage issue. (My understanding is that > streaming replicas stream
> their changes from memory, not from wals.)
>
>   Streaming replication stream their changes from wals.

Yeah.  This smells like disk corruption to me, but it really could be
anything.  Unfortunately it can spread to the replicas especially if
you're not timely about taking the master down.  page checksums (a
proposed feature) are a way of dealing with this problem.

The biggest issue is the missing clog files -- did you have more than
one replica? Were they missing on all of them?

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general