Re: pg_dump error... Follow up

Tom Lane <tgl@xxxxxxxxxxxxx> · Wed, 07 Sep 2005 13:06:26 -0400

Adam Witney <awitney@xxxxxxxxxx> writes:
> I think I have found the offending row in measured_bioassay_base... The
> entry in its toast table looks like this

> bugasbase2=# select chunk_id, chunk_seq, length(chunk_data) from
> pg_toast.pg_toast_134401982 where chunk_id = 144391872;
>  chunk_id  | chunk_seq | length
> -----------+-----------+--------
>  144391872 |         0 |   1998
>  144391872 |         1 |   1998
>  144391872 |         2 |   1998
>  144391872 |         3 |   1998
>  144391872 |         4 |   1998
>    7625296 |      3292 |     24
>    7625297 |      3292 |     24
>    7625298 |      3292 |     24
>    7625299 |      3292 |     24
>    7625308 |      3292 |     19

This is pretty wacko, because as far as I can see there is nothing wrong
with the index at all --- in particular, no gap in the chunk id/seq
series here, according to pg_filedump.  (The last two words of each
index item are the chunk_id and chunk_seq, and we are looking at
LSB-first data)

 Item 102 -- Length:   16  Offset: 6544 (0x1990)  Flags: USED
  Block Id: 165390  linp Index: 4  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1990: 02000e86 04001000 c03e9b08 00000000  .........>......

 Item 103 -- Length:   16  Offset: 6528 (0x1980)  Flags: USED
  Block Id: 165391  linp Index: 1  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1980: 02000f86 01001000 c03e9b08 01000000  .........>......

 Item 104 -- Length:   16  Offset: 6512 (0x1970)  Flags: USED
  Block Id: 165391  linp Index: 2  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1970: 02000f86 02001000 c03e9b08 02000000  .........>......

 Item 105 -- Length:   16  Offset: 6496 (0x1960)  Flags: USED
  Block Id: 165391  linp Index: 3  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1960: 02000f86 03001000 c03e9b08 03000000  .........>......

 Item 106 -- Length:   16  Offset: 6480 (0x1950)  Flags: USED
  Block Id: 165391  linp Index: 4  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1950: 02000f86 04001000 c03e9b08 04000000  .........>......

 Item 107 -- Length:   16  Offset: 6464 (0x1940)  Flags: USED
  Block Id: 165392  linp Index: 1  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1940: 02001086 01001000 c03e9b08 05000000  .........>......

 Item 108 -- Length:   16  Offset: 6448 (0x1930)  Flags: USED
  Block Id: 165392  linp Index: 2  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1930: 02001086 02001000 c03e9b08 06000000  .........>......

 Item 109 -- Length:   16  Offset: 6432 (0x1920)  Flags: USED
  Block Id: 165392  linp Index: 3  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1920: 02001086 03001000 c03e9b08 07000000  .........>......

 Item 110 -- Length:   16  Offset: 6416 (0x1910)  Flags: USED
  Block Id: 165392  linp Index: 4  Size: 16
  Has Nulls: 0  Has Varwidths: 0

  1910: 02001086 04001000 c03e9b08 08000000  .........>......

What I think is going on is that you have a corrupted copy of this index
page sitting in shared buffers.  It's not been marked dirty, else it'd
have been written out by a checkpoint and we could see the damage with
pg_filedump.  This implies that the problem would go away by itself if
you simply shut down and restarted the database.

However, since you've been so helpful so far ;-) I wonder whether there
is any chance of getting an image of the bad page in memory before you
do that.  We are presumably looking at the results of an unintentional
memory stomp somewhere in the code, and it's possible that a dump of the
damaged page would give a clue how it happened.

The only simple way I can think of for getting that info involves gdb,
but I don't suppose you are using a copy built with --enable-debug ...
anyone have other ideas?

			regards, tom lane