RAID6 r-m-w, op-journaled fs, SSDs

pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi) · Sat, 30 Apr 2011 16:27:48 +0100

While I agree with BAARF.com arguments fully, I sometimes have
to deal with legacy systems with wide RAID6 sets (for example 16
drives, quite revolting) which have op-journaled filesystems on
them like XFS or JFS (sometimes block-journaled ext[34], but I
am not that interested in them for this).

Sometimes (but fortunately not that recently) I have had to deal
with small-file filesystems setup on wide-stripe RAID6 setup by
morons who don't understand the difference between a database
and a filesystem (and I have strong doubts that RAID6 is
remotely appropriate to databases).

So I'd like to figure out how much effort I should invest in
undoing cases of the above, that is how badly they are likely to
be and degrade over time (usually very badly).

First a couple of question purely about RAID, but indirectly
relevant to op-journaled filesystems:

  * Can Linux MD do "abbreviated" read-modify-write RAID6
    updates like for RAID5? That is where not the whole stripe
    is read in, modified and written, but just the block to be
    updated and the parity wblocks.

  * When reading or writing part of RAID[456] stripe for example
    smaller than a sector, what is the minimum unit of transfer
    with Linux MD? The full stripe, the chunk containing the
    sector, or just the sector containing the bytes to be
    written or updated (and potentially the parity sectors)? I
    would expect reads to always read just the sector, but not
    so sure about writing.

  * What about popular HW RAID host adapter (e.g. LSI, Adaptec,
    Areca, 3ware), where is the documentation if any on how they
    behave in these cases?

Regardless, op-journaled file system designs like JFS and XFS
write small records (way below a stripe set size, and usually
way below a chunk size) to the journal when they queue
operations, even if sometimes depending on design and options
may "batch" the journal updates (potentially breaking safety
semantics). Also they do small write when they dequeue the
operations from the journal to the actual metadata records
involved.

How bad can this be when the journal is say internal for a
filesystem that is held on wide-stride RAID6 set? I suspect very
very bad, with apocalyptic read-modify-write storms, eating IOPS.

I suspect that this happens a lot with SSDs too, where the role
of stripe set size is played by the erase block size (often in
the hundreds of KBytes, and even more expensive).

Where are studies or even just impressions of anedoctes on how
bad this is?

Are there instrumentation tools in JFS or XFS that may allow me
to watch/inspect what is happening with the journal? For Linux
MD to see what are the rates of stripe r-m-w cases?

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs