Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?)

pg_xf2@xxxxxxxxxxxxxxxxxx (Peter Grandi) · Sat, 7 Apr 2012 17:50:02 +0100

[ ... ]

>> As to this, in theory even having split the files among 4
>> AGs, the upload from system RAM to host adapter RAM and then
>> to disk could happen by writing first all the dirty blocks
>> for one AG, then a long seek to the next AG, and so on, and
>> the additional cost of 3 long seeks would be negligible.

> Yes, that’s exactly what I had in mind, and what prompted me
> to write this post. It would be about 10 times as fast.

Ahhh yes, but let's go back to this and summarize some of my
previous observations:

  * If the scheduling order was by AG, and the hardware was
    parallel, the available parallelism would not be exploited,
    (and fragmentation might be worse) as if there were only a
    single AG. And XFS does let you configure the number of AGs
    in part for for that reason.

  * Your storage layer does not seem to deliver parallel
    operations: as the ~100MB/s overall 'ext4' speed and the
    seek graphs show, in effect your 4+2 RAID6 performs in this
    case as if it were a single drive with a single arm.

  * Even with the actual scheduling at the Linux level being by
    interleaving AGs in XFS, your host adapter with a BBWC
    should be able to reorder them, in 256MiB lots, ignoring
    Linux level barriers and ordering, but it seems that this is
    not happening.

So the major things to look into seem to me:

  * Ensure that your RAID set can deliver the parallelism at
    which XFS is targeted, with the bulk transfer rates that it
    can do.

  * Otherwise figure out ways to ensure that the IO transactions
    generated by XFS are not in interleave-AG order.

  * Otherwise figure out ways to get the XFS IO ordering
    rearranged at the storage layer in spacewise order.

Summarizing some of the things to try, and some of them are
rather tentative, because you have a rather peculiar corner
case:

  * Change the flusher to writeout incrementally instead of just
    at 'sync' time, e.g. every 1-2 seconds. In some similar
    cases this makes things a lot better, as large 'uploads' to
    the storage layer from the page cache can cause damaging
    latencies. But the success of this may depend on having a
    properly parallel storage layer, at least for XFS.

  * Use a different RAID setup. If the RAID set is used only for
    reproducible data, a RAID0, else a RAID10, or even a RAID5
    with a small chunk size.

  * Check the elevator and cache policy on the P400, if they are
    settable. Too bad many RAID host adapters have (euphemism)
    hideous fw (many older 3ware models come to mind) with some
    undocumented (euphemism) peculiarties as to scheduling.

  * Tweak 'queue/nr_requests' and 'device/queue_depth'. Probably
    they should be big (hundreds/thousands), but various
    settings should be tried as fw sometimes is so weird.

  * Given that it is now established that your host adapter has
    BBWC, consider switching the Linux elevator to 'noop', so as
    to leave IO scheduling to the host adapter fw, and reduce
    issue latency. 'queue/nr_requests' may be set to a very low
    number here perhaps, but my guess is that it shouldn't matter.

  * Alternatively if the host adapter fw insists on not
    reordering IO from the Linux level, use Linux elevator
    settings that behaves similarly to 'anticipatory'.

It may help to use Bonnie (Garloff's 1.4 version with
'-o_direct') to give a rough feel of filetree speed profile, for
example I tend to use these options:

  Bonnie -y -u -o_direct -s 2000 -v 2 -d "$DIR"

Ultimately even 'ext4' does not seem the right filesystem for
this workload either, because all these "legacy" filesystems are
targeted at situations where data is much bigger than memory,
and you are trying to fit them into a very specific corner case
where the opposite is true.

Making my fantasy run wild, my guess is that your workload is
not 'tar x', but release building, where sources and objects fit
entirely in memory, and you are only concerned with persisting
the sources because you want to do several builds from that set
of sources without re-tar-x-ing them, and ideally you would like
to reduce build times by building several objects in parallel.

  BTW your corner case then has another property here: that disk
  writes greatly exceed disk reads, because you would only write
  once the sources and then read them from cache every time
  thereafter while the system is up. I doubt also that you would
  want to persist the generated objects themselves, but only the
  generated final "package" containing them, which might suggest
  building the objects to a 'tmpfs', unlss you want them
  persisted (a bit) to make builds restartable.

If that's the case, and you cannot fix the storage layer to be
more suitable for 'ext4' or XFS, consider using NILFS2, or even
'ext2' (with a long flusher interval perhaps).

  Note: or "cheat" and do your builds to a flash SSD, as they
    both run a fw layer that implements a COW/logging allocation
    strategy, and have nicer seek times :-).

> That’s what bothers me so much.

And in case you did not get this before, I have a long standing
pet peeve about abusing filesystems for small file IO, or other
ways of going against the grain of what is plausible, which I
call the "syntactic approach" (every syntactically valid system
configuration is assumed to work equally well...).

Some technical postscripts:

  * It seems that most if not all RAID6 implementations don't do
    shortened RMWs, where only the updated blocks and the PQ
    blocks are involved, but they always do full stripe RMW.
    Even with a BBWC in the host adapter this is one major
    reason to avoid RAID6 in favor of at least RAID5, for your
    setup in particular. But hey, RAID6 setups are all
    syntactically valid! :-)

  * The 'ext3' on disk layout and allocation policies seem to
    deliver very good compact locality on bulk writeouts and
    on relatively fresh filetrees, but then locality can degrade
    apocaliptically over time, like seven times:
      http://www.sabi.co.uk/blog/anno05-3rd.html#050913
    I suspect that the same applies to 'ext4', even if perhaps a
    bit less. You have tried to "age" the filetree a bit, but I
    suspect you did not succeed enough, as the graphed Linux
    level seek patterns with 'ext4' shows a mostly-linear write.

  * Hopefully your storage layer does not use DM/LVMs...

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs