I have turned on log_checkpoints, and re-run the command. Checkpoints are being written every 220 to 360 seconds. About 80% are "time" and 20% "xlog". Here are a representative sample:
10/12/09 5:12:59 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:16:41 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM org.postgresql.postgres[445] LOG: checkpoint starting: xlog
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:27:58 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:34:11 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s
The above was on a RAID 0 array of Intel X-25M drives. Spec'd speed is 500MB/s read, 140MB/s write, 7000 random writes/s. Observed speed (outside postgres, not a benchmark) is 420MB/s read, 140MB/s write, 4000 random writes/s.
I have now temporarily moved the entire database (and postgres data directories) to a single 7200rpm SATA drive (no RAID, no special drivers - just an ordinary WD Caviar Black 1TB). Unfortunately the situation appears very similar (and even slower):
11/12/09 12:26:59 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 196843 buffers (37.5%); 0 transaction log file(s) added, 0 removed, 54 recycled; write=201.260 s, sync=202.874 s, total=404.224 s
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 150293 buffers (28.7%); 0 transaction log file(s) added, 0 removed, 81 recycled; write=198.739 s, sync=192.861 s, total=391.682 s
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 136752 buffers (26.1%); 0 transaction log file(s) added, 0 removed, 61 recycled; write=170.165 s, sync=181.921 s, total=352.199 s
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:51:51 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 146884 buffers (28.0%); 0 transaction log file(s) added, 0 removed, 57 recycled; write=187.262 s, sync=156.595 s, total=343.916 s
Some further observations:
1) The situation is a bit different during the first 15 minutes of running the huge query. There is more CPU and disk activity (a lot of reading and writing - what I would call a busy disk). But after a while (hours), it settles down to a pattern of almost zero read I/O. Write I/O is a brief burst (<2 seconds, 2000 writes/sec, 30-50 MB/s) followed by about 30 seconds of 20-60 writes/sec (500KB/s). CPU is negligible.
2) The HDD temperature is very low - identical to a neighbor that is completely idle. I can't hear any disc rattling.
3) During these ~30 second activity pauses, access to the HDD from outside of postgres stalls.
4) DTrace is reporting fsync delays of 12 to 72 seconds (30 seconds typical) from calls made by postgres. About 40% of calls return in < 10 microseconds, 50% return in 30 - 250 ms, 3% are in the seconds, and 2% are in the tens of seconds.
5) Although the checkpoint log reports say 190s of writing and 190s of sync, the activity pattern reported by Activity Monitor is 95% near idle, and 5% activity.
Stephen