Alan, Looks like Postgres gets sensible scan rate scaling as the filesystem speed increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs got faster since I last tested this. The scaling looks like 64% of the I/O subsystem speed is available to the executor - so as the I/O subsystem increases in scan rate, so does Postgres' executor scan speed. So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? - Luke ================= Results =================== Unless noted otherwise all results posted are for block device readahead set to 16M using "blockdev --setra=16384 <block_device>". All are using the 2.6.9-11 Centos 4.1 kernel. For those who don't have lmdd, here is a comparison of two results on an ext2 filesystem: ============================================================================ [root@modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile bs=8k count=800000 && sync)" 800000+0 records in 800000+0 records out real 0m33.057s user 0m0.116s sys 0m13.577s [root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=800000 sync=1 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec real 0m33.032s user 0m0.087s sys 0m13.129s ============================================================================ So lmdd with sync=1 is equivalent to a sync after a dd. I use 2x memory with dd for the *READ* performance testing, but let's make sure things are synced on both write and read for this set of comparisons. First, let's test ext2 versus "ext3, data=ordered", versus xfs: ============================================================================ 16GB write, then read ============================================================================ ----------------------- ext2: ----------------------- [root@modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=2000000 sync=1 16384.0000 MB in 144.2670 secs, 113.5672 MB/sec [root@modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=2000000 sync=1 16384.0000 MB in 49.3766 secs, 331.8170 MB/sec ----------------------- ext3, data=ordered: ----------------------- [root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=2000000 sync=1 16384.0000 MB in 137.1607 secs, 119.4511 MB/sec [root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=2000000 sync=1 16384.0000 MB in 48.7398 secs, 336.1527 MB/sec ----------------------- xfs: ----------------------- [root@modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=2000000 sync=1 16384.0000 MB in 52.6141 secs, 311.3994 MB/sec [root@modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=2000000 sync=1 16384.0000 MB in 40.2807 secs, 406.7453 MB/sec ============================================================================ I'm liking xfs! Something about the way files are layed out, as Alan suggested seems to dramatically improve write performance and perhaps consequently the read also improves. There doesn't seem to be a difference between ext3 and ext2, as expected. Now on to the Postgres 8 tests. We'll do a 16GB table size to ensure that we aren't reading from the read cache. I'll write this file through Postgres COPY to be sure that the file layout is as Postgres creates it. The alternative would be to use COPY once, then tar/untar onto different filesystems, but that may not duplicate the real world results. These tests will use Bizgres 0_8_1, which is an augmented 8.0.3. None of the augmentations act to improve the executor I/O though, so for these purposes it should be the same as 8.0.3. ============================================================================ 26GB of DBT-3 data from the lineitem table ============================================================================ llonergan=# select relpages from pg_class where relname='lineitem'; relpages ---------- 3159138 (1 row) 3159138*8192/1000000 25879 Million Bytes, or 25.9GB ----------------------- xfs: ----------------------- llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 394908.501 ms llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 99425.223 ms llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 99187.205 ms ----------------------- ext2: ----------------------- llonergan=# select relpages from pg_class where relname='lineitem'; relpages ---------- 3159138 (1 row) llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 395286.475 ms llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 195756.381 ms llonergan=# select count(1) from lineitem; count ----------- 119994608 (1 row) Time: 122822.090 ms ============================================================================ Analysis of Postgres 8.0.3 results ============================================================================ ext2 xfs Write Speed 114 311 Read Speed 332 407 Postgres Seq Scan Speed 212 263 Scan % of lmdd Read Speed 63.9% 64.6% Well - looks like we get linear scaling with disk/file subsystem speedup. - Luke