I had the opportunity to do more testing on another new server to see whether the kernel's I/O scheduling makes any difference. Conclusion: On a battery-backed RAID 10 system, the kernel's I/O scheduling algorithm has no effect. This makes sense, since a battery-backed cache will supercede any I/O rescheduling that the kernel tries to do.
Hardware:
Dell 2950
8 CPU (Intel 2GHz Xeon)
8 GB memory
Dell Perc 6i with battery-backed cache
RAID 10 of 8x 146GB SAS 10K 2.5" disks
Software:
Linux 2.6.24, 64-bit
XFS file system
Postgres 8.3.0
max_connections = 1000
shared_buffers = 2000MB
work_mem = 256MB
max_fsm_pages = 1000000
max_fsm_relations = 5000
synchronous_commit = off
wal_buffers = 256kB
checkpoint_segments = 30
effective_cache_size = 4GB
Each test was run 5 times:
drop database test
create database test
pgbench -i -s 20 -U test
pgbench -c 10 -t 50000 -v -U test
The I/O scheduler was changed on-the-fly using (for example) "echo cfq >/sys/block/sda/queue/scheduler".
Autovacuum was turned off during the test.
Here are the results. The numbers are those reported as "tps = xxxx (including connections establishing)" (which were almost identical to the "excluding..." tps number).
I/O Sched AVG Test1 Test2 Test3 Test4 Test5
--------- ----- ----- ----- ----- ----- -----
cfq 3355 3646 3207 3132 3204 3584
noop 3163 2901 3190 3293 3124 3308
deadline 3547 3923 3722 3351 3484 3254
anticipatory 3384 3453 3916 2944 3451 3156
As you can see, the averages are very close -- closer than the "noise" between runs. As far as I can tell, there is no significant advantage, or even any significant difference, between the various I/O scheduler algorithms.
(It also reinforces what the pgbench man page says: Short runs aren't useful. Even these two-minute runs have a lot of variability. Before I turned off AutoVacuum, the variability was more like 50% between runs.)
Craig