Re: O_DIRECT to md raid 6 is slow

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 19 Aug 2012 11:02:59 -0600

On Aug 18, 2012, at 9:17 PM, Stan Hoeppner wrote:
>> 
> 
> Yes, as in the case of XFS journal alignment, where the maximum stripe
> unit (chunk) size is 256KB and the recommended size is 32KB.  This is a
> 100% metadata workload, making full stripe writes difficult even with a
> small stripe unit (chunk).  Large chunks simply make it much worse.  And
> every modern filesystem uses a journal…

I agree that a bigger chunk size is not inherently better. I suspect 512K is selected for the default because for most people storage loads, which aren't spectacularly heavy (either data or metadata). But all the documentation I find on mdadm fairly well hits home that to get the best performance, you have to test.

One small quibble, however, is that the three newest filesystems, don't use journals: ZFS, btrfs, ReFS.

> 
>> Overall, I think 512Kb is quite a good chunk size, even for a raid5
>> array.
> 
> I emphatically disagree.  For the vast majority of workloads, with a
> 512KB chunk RAID5/6, nearly every write will trigger RMW, and RMW is
> what kills parity array performance.  And RMW is *far* more costly than
> sending smaller vs larger IOs to the drives.

I thought that default seemed a bit high, but I'll bet you dollars to donuts the vast majority of workloads using default settings for parity RAID, are 4+MB files like music and video. I think if you get a really busy mail server, lots of tiny files, then you've got a pretty strong case that 512K across maybe 6 disks, is going to lead to a lot of unnecessary RMW, and a lower chunk size will help a lot.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html