Re: Tracking down log segment corruption

Gordon Shannon <gordo169@xxxxxxxxx> · Sun, 2 May 2010 11:55:26 -0600

On Sun, May 2, 2010 at 11:02 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

Hmm ... AFAICS the only way to get that message when the incoming TID's

offsetNumber is only 2 is for the index page to be completely empty

(not zeroes, else PageAddItem's sanity check would have triggered,

but valid and empty).  What that smells like is a software bug, like

failing to emit a WAL record in a case where it was necessary.  Can you

identify which index this was?  (Look for relfilenode 48778276 in the

database with OID 22362.)  If so, can you give us any hints about

unusual things that might have been done with that index?

Interesting. There is no pg_class entry for 22362.  There is, however, an entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---------------------------
oid            | 48777488
relname        | cts_20100501_topic_date_nk
relnamespace   | 2200
reltype        | 0

relowner       | 16412
relam          | 403
relfilenode    | 48778276
reltablespace  | 48777166
relpages       | 2476
reltuples      | 58879
reltoastrelid  | 0
reltoastidxid  | 0
relhasindex    | f

relisshared    | f
relistemp      | f
relkind        | i
relnatts       | 2
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 0

relacl         | null
reloptions     | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces.  I noted the new files on the WSB, no problems.

- On Sat morning, I created a partitioned table cts_20100501 (inherits from another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and discovered I had created this table and its indexes in the wrong tablespace.

I wanted the table in ts29, but had it in ts30.  Vice versa for the indexes.  So I moved them. This is from my command history:

alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;

alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously relevant.

> Any suggestions?

As far as recovering goes, there's probably not much you can do except

resync the standby from scratch.  But it would be nice to get to the

bottom of the problem, so that we can fix the bug.  Have you got an

archive of this xlog segment and the ones before it, and would you be

willing to let a developer look at them?

Before I received your reply, I had already started the re-sync, and unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share them.

Regards,
Gordon