Hi, we (Q-Leap networks) are in the process of setting up a high speed storage cluster and we are having some problems getting proper performance. Our test system consists of a 2x dual core system with 2 dual channel UW scsi controlers connected to 2 external raid boxes and we use iozone with 16GB data on an lustre (ldiskfs) filesystem as speed test below. The raid boxes internally run raid6 and are split in 2 partitions, one maped to each scsi port. Read-Ahead is set to 32768. sdb system controler 1: box 1 controler 1 sdc system controler 1: box 2 controler 1 sdd system controler 2: box 1 controler 2 sde system controler 2: box 2 controler 2 Plain disks: sdb1 sdc1 sdd1 sde1 -------------------------------- write rewrite read reread 1 Thread : 225204 269084 288718 288219 2 Threads: 401154 414525 441005 440564 3 Threads: 515818 528943 598863 599455 4 Threads: 587184 638971 737094 730850 raid1 [sdb1 sde1] [sdc1 sdd1] chunk=8192 ---------------------------------------- write rewrite read reread 1 Thread : 179262 271810 293111 293593 2 Threads: 326260 345276 496189 498250 4 Threads: 333085 308820 686983 679123 8 Threads: 348458 277097 643260 673025 raid10 f2 [sdb1 sdc1 sdd1 sde1] chunk=8192 ------------------------------------------ write rewrite read reread 1 Thread : 215560 323921 466460 436195 2 Threads: 288001 304094 611157 586583 4 Threads: 336072 298115 639925 662107 8 Threads: 243053 183969 665743 638512 As you can see adding an raid1 or raid10 layer already costs a certain amount of performance. But all within reason. Now the real problem comes: raid5 [sdb1 sdc1 sdd1 sde1] chunk=64, stripe_cache_size=32768 ----------------------------------------------------------------------- write rewrite read reread 1 Thread : 178540 176061 384928 384653 2 Threads: 218113 214308 379950 376312 4 Threads: 225560 160209 359628 359170 8 Threads: 232252 165669 261981 274043 The performance is totaly limited by pdflush (>80% cpu during write) with md0_raid5 eating up a substantial percentage too. raid5 [sdb1 sdc1 sdd1 sde1] chunk=8192, stripe_cache_size=32768 ----------------------------------------------------------- write rewrite read reread 1 Thread : 171138 185105 424504 428974 2 Threads: 165225 141431 553976 545088 4 Threads: 178189 110153 582999 581266 8 Threads: 177892 99679 568720 594580 This is even stranger. Now pdflush uses less cpu (10-70%) but md0_raid5 is blocking with >95% cpu during write. Three questions: 1) pdflush is limited to one thread per filesystem. For our useage that is a bottleneck. Can anything be done there? 2) Why is read performance so lousy with small chunk size? 3) Why does raid5 take so much more cpu time on write with larger chunk size? The amount of data to checksumm is the same (same speed) but the cpu time used goes way up. There are no read-modify-write cycles in there according to vmstat, plain continious writes. MfG Goswin - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html