Re: Corrupt indexes on slave when using pg_bulkload on master

James Cowell <jcowell@xxxxxxxxxxxxxx> · Thu, 3 Jan 2013 16:12:50 +0000 (GMT)

Hi Jeff (and group)

Was the reproduction information sufficient?

Do I need to submit this officially as a bug or something?

At the moment I'm considering rebuilding my cluster with 9.0 to see if that works and if not then reverting back to 9.1 but loading each DB seperately.  I would really like to understand why a load of 10 sequential rows with pg_bulkload produces a corrupt index on node 2 though, it just doesn't make sense to me.

Thanks and Happy New Year!

James

  From: James Cowell <jcowell@xxxxxxxxxxxxxx>
 To: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx> 
Cc: Jeff Janes <jeff.janes@xxxxxxxxx> 
 Sent: Wednesday, 19 December 2012, 13:11
 Subject: Re:  Corrupt indexes on slave when using pg_bulkload on master

Actually, scratch that.  The difference in behaviour seems to be on the optimiser which now table scans the 10 row table (which I guess it should always have done really) rather than use the index as it was in 9.1.6.

The same index corruption occurs, so the same reproduction case stands, it just needs a "set enable_seqscan=false" prior to running the selects on the slave.

Cheers,

James

  From: James Cowell <jcowell@xxxxxxxxxxxxxx>
 To: Jeff Janes <jeff.janes@xxxxxxxxx> 
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx> 
 Sent: Monday, 17 December 2012, 20:42
 Subject: Re:  Corrupt indexes on slave when using pg_bulkload on master

I saw that 9.1.7 was out and it
 had another bugfix in for WAL playback.

I installed it and ran my reproduction case and it no longer corrupts the index.

I reindexed the database I load into and did a full data load and the indexes still corrupt on the slave.

It does not appear to be related to constraint violation as one table is new rows only.

I will try and put together a new reproduction case this week.

Cheers,

James

  From: James Cowell <jcowell@xxxxxxxxxxxxxx>
 To: Jeff Janes <jeff.janes@xxxxxxxxx> 
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx> 
 Sent: Thursday, 13 December 2012, 12:26
 Subject: Re: [GENERAL] Corrupt indexes on slave when using pg_bulkload on master

Hi Jeff,

Thanks again for your reply.

>If there are no constraint
 violations, do you still see the problem?

Yes, I've stripped it down to an empty table with a 10 row load and the pk index on the secondary node still corrupts.

> Were there any older version on which it worked? 

I'm afraid I started on 9.1.5, I upgraded to 9.1.6 when I had the initial problem due to the bugfix in the changelog to do with corrupt indexes on the secondary node but it hasn't resolved the issue.

> Can
 you post a minimal schema and control file to reproduce the problem?

I've attached a text file with details for table, load config file etc, is that everything you would need?

Cheers,

James

  From: Jeff Janes <jeff.janes@xxxxxxxxx>
 To: James Cowell
 <jcowell@xxxxxxxxxxxxxx> 
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx> 
 Sent: Monday, 10 December 2012, 16:53
 Subject: Re:  Corrupt indexes on slave when using pg_bulkload on master

On Wed, Dec 5, 2012 at 5:17 AM, James Cowell <jcowell@xxxxxxxxxxxxxx> wrote:
> I'm using pg_bulkload to load large amounts of CSV data into a postgres
> database hourly.
>
> This database is replicated to a second node.
>
> Whenever a bulk load happens the indexes on the updated tables on the
> secondary node corrupt and are unusable until a reindex is run on the
> primary node.  I get the
 error below on node
 2:
>
> ERROR: index "tablename" contains unexpected zero page at block 0
> SQL state: XX002
> Hint: Please REINDEX it.
>
> I'm assuming that this is because of the way pg_bulkload builds the index on
> the primary, and possibly has something to do with the way pg_bulkload
> overwrites rows in the event of a constraint violation,

If there are no constraint violations, do you still see the problem?

> but at the same time
> if something works on the primary shouldn't the replicated node be able to
> process the WAL log?
>
> I've tried this on 9.1.6 and 9.1.5 on RHEL 6.3 with pg_bulkload build
> 3.1.1-1.pg91.rhel6 and it happens every time.

Were there any older version on which it worked?  Can you post a
minimal schema and control file to reproduce the problem?

Cheers,

Jeff