On Thu, Jun 28, 2007 at 10:24:54AM +0200, Peter Rabbitson wrote: > Interesting, I came up with the same results (1M chunk being superior) > with a completely different raid set with XFS on top: > > mdadm --create \ > --level=10 \ > --chunk=1024 \ > --raid-devices=4 \ > --layout=f3 \ > ... > > Could it be attributed to XFS itself? Sort of.. /dev/md4: Version : 00.90.03 Raid Level : raid5 Raid Devices : 4 Total Devices : 4 Preferred Minor : 4 Active Devices : 4 Working Devices : 4 Layout : left-symmetric Chunk Size : 256K This means there are 3x 256k for the user data.. Now I had to carefully tune the XFS bsize/sunit/swidth to match that: meta-data=/dev/DataDisk/lvol0 isize=256 agcount=32, agsize=7325824 blks = sectsz=512 attr=1 data = bsize=4096 blocks=234426368, imaxpct=25 = sunit=64 swidth=192 blks, unwritten=1 ... That is, 4k * 64 = 256k, and 64 * 3 = 192 With that, bulk writing on the file system runs without need to read back blocks of disk-space to calculate RAID5 parity data because the filesystem's idea of block does not align with RAID5 surface. I do have LVM in between the MD-RAID5 and XFS, so I did also align the LVM to that 3 * 256k. Doing this alignment thing did boost write performance by nearly a factor of 2 from mkfs.xfs with default parameters. With very wide RAID5, like the original question... I would find it very surprising if the alignment of upper layers to MD-RAID level would not be important there as well. Very small continuous writing does not make good use of disk mechanism, (seek time, rotation delay), so something in order of 128k-1024k will speed things up -- presuming that when you are writing, you are doing it many MB at the time. Database transactions are a lot smaller, and are indeed harmed by such large megachunk-IO oriented surfaces. RAID-levels 0 and 1 (and 10) do not have the need of reading back parts of the surface because a subset of it was not altered by incoming write. Some DB application on top of the filesystem would benefit if we had a way for it to ask about these alignment boundary issues, so it could read whole alignment block even though it writes out only a subset of it. (Theory being that those same blocks would also exist in memory cache and thus be available for write-back parity calculation.) > Peter /Matti Aarnio - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html