I seem to have solved my problem, out of frustration I downloaded the source to see what I could work out (although it's a good while since I did C) and I found these comments in nbtsort-9.1.c:
* Formerly the index pages being built were kept in shared buffers, but
* that is of no value (since other backends have no interest in them yet)
* and it created locking problems for CHECKPOINT, because the upper-level
* pages were held exclusive-locked for long periods. Now we just build
* the pages in local memory and smgrwrite or smgrextend them as we finish
* them. They will need to be re-read into shared buffers on first use after
* the build finishes.
*
* Since the index will never be used unless it is completely built,
* from a crash-recovery point of view there is no need to WAL-log the
* steps of the build. After completing the index build, we can just sync
* the whole file to disk using smgrimmedsync() before exiting this module.
* This can be seen to be sufficient for crash recovery by considering that
* it's effectively equivalent to what would happen if a CHECKPOINT occurred
* just after the index build. However, it is clearly not sufficient if the
* DBA is using the WAL log for PITR or replication purposes, since another
* machine would not be able to reconstruct the index from WAL. Therefore,
* we log the completed index pages to WAL if and only if WAL archiving is
* active.
* that is of no value (since other backends have no interest in them yet)
* and it created locking problems for CHECKPOINT, because the upper-level
* pages were held exclusive-locked for long periods. Now we just build
* the pages in local memory and smgrwrite or smgrextend them as we finish
* them. They will need to be re-read into shared buffers on first use after
* the build finishes.
*
* Since the index will never be used unless it is completely built,
* from a crash-recovery point of view there is no need to WAL-log the
* steps of the build. After completing the index build, we can just sync
* the whole file to disk using smgrimmedsync() before exiting this module.
* This can be seen to be sufficient for crash recovery by considering that
* it's effectively equivalent to what would happen if a CHECKPOINT occurred
* just after the index build. However, it is clearly not sufficient if the
* DBA is using the WAL log for PITR or replication purposes, since another
* machine would not be able to reconstruct the index from WAL. Therefore,
* we log the completed index pages to WAL if and only if WAL archiving is
* active.
I enabled archive mode (which I didn't care about before as the database
only holds 36 hours of data) and the indexes seem to replicate over fine. I suppose the problem here is lack of documentation, but at least the code is well commented :)
It looks like pg_bulkload works just fine with replication so long as it's set up right.
Cheers,
James
From: James Cowell <jcowell@xxxxxxxxxxxxxx>
To: James Cowell <jcowell@xxxxxxxxxxxxxx>; "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Cc: Jeff Janes <jeff.janes@xxxxxxxxx>
Sent: Thursday, 3 January 2013, 16:12
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
To: James Cowell <jcowell@xxxxxxxxxxxxxx>; "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Cc: Jeff Janes <jeff.janes@xxxxxxxxx>
Sent: Thursday, 3 January 2013, 16:12
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
Hi Jeff (and group)
Was the reproduction information sufficient?
Do I need to submit this officially as a bug or something?
At the moment I'm considering rebuilding my
cluster with 9.0 to see if that works and if not then reverting back to 9.1 but loading each DB seperately. I would really like to understand why a load of 10 sequential rows with pg_bulkload produces a corrupt index on node 2 though, it just doesn't make sense to me.
Thanks and Happy New Year!
James
From: James Cowell <jcowell@xxxxxxxxxxxxxx>
To: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Cc: Jeff Janes <jeff.janes@xxxxxxxxx>
Sent: Wednesday, 19 December 2012, 13:11
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
To: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Cc: Jeff Janes <jeff.janes@xxxxxxxxx>
Sent: Wednesday, 19 December 2012, 13:11
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
Actually, scratch that. The difference in behaviour seems to be on the optimiser which now table scans the 10 row table (which I guess it should always have done really) rather than use the index as it was in 9.1.6.
The same index corruption occurs, so the same reproduction case
stands, it just needs a "set enable_seqscan=false" prior to running the selects on the slave.
Cheers,
James
From: James Cowell <jcowell@xxxxxxxxxxxxxx>
To: Jeff Janes <jeff.janes@xxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Monday, 17 December 2012, 20:42
Subject: Re: [GENERAL] Corrupt indexes on slave when using pg_bulkload on master
To: Jeff Janes <jeff.janes@xxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Monday, 17 December 2012, 20:42
Subject: Re: [GENERAL] Corrupt indexes on slave when using pg_bulkload on master
I saw that 9.1.7 was out and it
had another bugfix in for WAL playback.
I installed it and ran my reproduction case and it no longer corrupts the index.
I reindexed the database I load into and did a full data load and the indexes still corrupt on the slave.
It does not appear to be related to constraint violation as one table is new rows only.
I will try and put together a new reproduction case this week.
Cheers,
James
From: James Cowell <jcowell@xxxxxxxxxxxxxx>
To: Jeff Janes <jeff.janes@xxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Thursday, 13 December 2012, 12:26
Subject: Re: [GENERAL] Corrupt indexes on slave when using pg_bulkload on master
To: Jeff Janes <jeff.janes@xxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Thursday, 13 December 2012, 12:26
Subject: Re: [GENERAL] Corrupt indexes on slave when using pg_bulkload on master
Hi Jeff,
Thanks again for your reply.
>If there are no
constraint
violations, do you still see the problem?
Yes, I've stripped it down to an empty table with a 10 row load and the pk index on the secondary node still corrupts.
> Were there any older version on which it worked?
I'm afraid I started on 9.1.5, I upgraded to 9.1.6 when I had the initial problem due to the bugfix in the changelog to do with corrupt indexes on the secondary node but it hasn't resolved the issue.
> Can
you post a minimal schema and control file to reproduce the problem?
I've attached a text file with details for table, load config file etc, is that everything you would need?
Cheers,
James
From: Jeff Janes <jeff.janes@xxxxxxxxx>
To: James Cowell <jcowell@xxxxxxxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Monday, 10 December 2012, 16:53
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
To: James Cowell <jcowell@xxxxxxxxxxxxxx>
Cc: "pgsql-general@xxxxxxxxxxxxxx" <pgsql-general@xxxxxxxxxxxxxx>
Sent: Monday, 10 December 2012, 16:53
Subject: Re: Corrupt indexes on slave when using pg_bulkload on master
On Wed, Dec 5, 2012 at 5:17 AM, James Cowell <jcowell@xxxxxxxxxxxxxx> wrote:
> I'm using pg_bulkload to load large amounts of CSV data into a postgres
> database hourly.
>
> This database is replicated to a second node.
>
> Whenever a bulk load happens the indexes on the updated tables on the
> secondary node corrupt and are unusable until a reindex is run on the
> primary node. I get the error below on node 2:
>
> ERROR: index "tablename" contains unexpected zero page at block 0
> SQL state: XX002
> Hint: Please REINDEX it.
>
> I'm assuming that this is because of the way pg_bulkload builds the index on
> the primary, and possibly has something to do with the way pg_bulkload
> overwrites rows in the event of a constraint violation,
If there are no constraint violations, do you still see the problem?
> but at the same time
> if something works on the primary shouldn't the replicated node be able to
> process the WAL log?
>
> I've tried this on 9.1.6 and 9.1.5 on RHEL 6.3 with pg_bulkload build
> 3.1.1-1.pg91.rhel6 and it happens every time.
Were there any older version on which it worked? Can you post a
minimal schema and control file to reproduce the problem?
Cheers,
Jeff