[ ... ] >> As to this, in theory even having split the files among 4 >> AGs, the upload from system RAM to host adapter RAM and then >> to disk could happen by writing first all the dirty blocks >> for one AG, then a long seek to the next AG, and so on, and >> the additional cost of 3 long seeks would be negligible. > Yes, that’s exactly what I had in mind, and what prompted me > to write this post. It would be about 10 times as fast. Ahhh yes, but let's go back to this and summarize some of my previous observations: * If the scheduling order was by AG, and the hardware was parallel, the available parallelism would not be exploited, (and fragmentation might be worse) as if there were only a single AG. And XFS does let you configure the number of AGs in part for for that reason. * Your storage layer does not seem to deliver parallel operations: as the ~100MB/s overall 'ext4' speed and the seek graphs show, in effect your 4+2 RAID6 performs in this case as if it were a single drive with a single arm. * Even with the actual scheduling at the Linux level being by interleaving AGs in XFS, your host adapter with a BBWC should be able to reorder them, in 256MiB lots, ignoring Linux level barriers and ordering, but it seems that this is not happening. So the major things to look into seem to me: * Ensure that your RAID set can deliver the parallelism at which XFS is targeted, with the bulk transfer rates that it can do. * Otherwise figure out ways to ensure that the IO transactions generated by XFS are not in interleave-AG order. * Otherwise figure out ways to get the XFS IO ordering rearranged at the storage layer in spacewise order. Summarizing some of the things to try, and some of them are rather tentative, because you have a rather peculiar corner case: * Change the flusher to writeout incrementally instead of just at 'sync' time, e.g. every 1-2 seconds. In some similar cases this makes things a lot better, as large 'uploads' to the storage layer from the page cache can cause damaging latencies. But the success of this may depend on having a properly parallel storage layer, at least for XFS. * Use a different RAID setup. If the RAID set is used only for reproducible data, a RAID0, else a RAID10, or even a RAID5 with a small chunk size. * Check the elevator and cache policy on the P400, if they are settable. Too bad many RAID host adapters have (euphemism) hideous fw (many older 3ware models come to mind) with some undocumented (euphemism) peculiarties as to scheduling. * Tweak 'queue/nr_requests' and 'device/queue_depth'. Probably they should be big (hundreds/thousands), but various settings should be tried as fw sometimes is so weird. * Given that it is now established that your host adapter has BBWC, consider switching the Linux elevator to 'noop', so as to leave IO scheduling to the host adapter fw, and reduce issue latency. 'queue/nr_requests' may be set to a very low number here perhaps, but my guess is that it shouldn't matter. * Alternatively if the host adapter fw insists on not reordering IO from the Linux level, use Linux elevator settings that behaves similarly to 'anticipatory'. It may help to use Bonnie (Garloff's 1.4 version with '-o_direct') to give a rough feel of filetree speed profile, for example I tend to use these options: Bonnie -y -u -o_direct -s 2000 -v 2 -d "$DIR" Ultimately even 'ext4' does not seem the right filesystem for this workload either, because all these "legacy" filesystems are targeted at situations where data is much bigger than memory, and you are trying to fit them into a very specific corner case where the opposite is true. Making my fantasy run wild, my guess is that your workload is not 'tar x', but release building, where sources and objects fit entirely in memory, and you are only concerned with persisting the sources because you want to do several builds from that set of sources without re-tar-x-ing them, and ideally you would like to reduce build times by building several objects in parallel. BTW your corner case then has another property here: that disk writes greatly exceed disk reads, because you would only write once the sources and then read them from cache every time thereafter while the system is up. I doubt also that you would want to persist the generated objects themselves, but only the generated final "package" containing them, which might suggest building the objects to a 'tmpfs', unlss you want them persisted (a bit) to make builds restartable. If that's the case, and you cannot fix the storage layer to be more suitable for 'ext4' or XFS, consider using NILFS2, or even 'ext2' (with a long flusher interval perhaps). Note: or "cheat" and do your builds to a flash SSD, as they both run a fw layer that implements a COW/logging allocation strategy, and have nicer seek times :-). > That’s what bothers me so much. And in case you did not get this before, I have a long standing pet peeve about abusing filesystems for small file IO, or other ways of going against the grain of what is plausible, which I call the "syntactic approach" (every syntactically valid system configuration is assumed to work equally well...). Some technical postscripts: * It seems that most if not all RAID6 implementations don't do shortened RMWs, where only the updated blocks and the PQ blocks are involved, but they always do full stripe RMW. Even with a BBWC in the host adapter this is one major reason to avoid RAID6 in favor of at least RAID5, for your setup in particular. But hey, RAID6 setups are all syntactically valid! :-) * The 'ext3' on disk layout and allocation policies seem to deliver very good compact locality on bulk writeouts and on relatively fresh filetrees, but then locality can degrade apocaliptically over time, like seven times: http://www.sabi.co.uk/blog/anno05-3rd.html#050913 I suspect that the same applies to 'ext4', even if perhaps a bit less. You have tried to "age" the filetree a bit, but I suspect you did not succeed enough, as the graphed Linux level seek patterns with 'ext4' shows a mostly-linear write. * Hopefully your storage layer does not use DM/LVMs... _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs