Re: I/O hang, possibly XFS, possibly general

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 4 Jun 2011 13:12:40 +1000

On Fri, Jun 03, 2011 at 03:28:54PM -0700, Phil Karn wrote:
> On 6/2/11 7:54 PM, Dave Chinner wrote:
> 
> > There are definitely cases where it helps for preventing
> > fragmenting, but as a sweeping generalisation it is very, very
> > wrong.
> 
> Well, if I ever see that in practice I'll change my procedures.
> 
> > Do you do that for temporary object files when you build <program X>
> > from source?
> 
> No, that would involve patching gcc to use fallocate(). I could be wrong
> -- I don't know much about gcc internals -- but I think most temp files
> go on /tmp, which is not xfs. As I clearly said, I patched only a few
> file copy programs like rsync that I use to create long-lived files. I
> can't see why the upstream maintainers of those programs shouldn't
> accept patches to incorporate fallocate() as long as care is taken to
> avoid calling the POSIX version and no other harm is done on file
> systems or OSes that don't support it.

They are trying, but, well, the file corruption problems seen on
2.6.38/.39 kernels that are the result of them using
fiemap/fallocate don't inspire me with confidence....

> > away by simply increasing the XFS inode size at mkfs time? And that
> > there is almost no performance penalty for doing this?  Instead, it
> > seems you found a hammer named fallocate() and proceeded to treat
> > every tool you have like a nail. :)
> 
> You do realize that I started experimenting with attributes well *after*
> I had built XFS on a 6 GB (net) RAID5 that took over a week of solid
> copying to load to 50%? I had noticed the inode size parameter to
> mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new
> file system with bigger inodes and copy all my data (again) just to
> waste more space on largely empty inodes and, more importantly, require
> many more disk seeks and reads to walk through them all.
> 
> The default xfs inode is 256 bytes. That means a single 4KiB block read
> fetches 16 inodes at once. Making each inode 512 bytes means reading
> only 8 inodes in each 4KiB block. That's arithmetic.

XFS does not do inode IO like that, so your logic is flawed.

Firstly, inodes are read and written in clusters of 8k, and
contiguous inode clusters are merged during IO by the elevator.
Metadata blocks are heavily sorted before being issued by for
writeback, so we get excellent large IO patterns even for metadata
IO.  Under heavy file create workloads, I'm seeing XFS consistently
write metadata to disk in 320k IOs - the maximum IO size my storage
subsystem will allow.

e.g. a couple of instructive graphs from Chris
Mason for a parallel file create workload:

http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.png
http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.ogg

The fact that ~5000 IOPS is being sustained with only 30-100 seeks/s
indicates that the elevator merging is merging roughly 50-100
individual IOs together into each physical IO. This will happen
regardless of inode size, so inode/metadata writeback under these
workloads tends to be limited by bandwidth, not IOPS....

Reads might be a bit more random, but due to inodes being allocated
in larger chunks (64 inodes at a time) and temporal locality effects
due to sequential allocation by apps like rsync, then typically
reads occur to localised areas as well and hit track caches or RAID
controller readahead windows.

> And I'd still have no guarantee of keeping my attributes in the inodes
> without some limit on the size of the extent list.

going from 256 -> 512 byte inodes gives you 256 bytes more space for
attributes and extents, which in your case woul dbe entirely for
data extents. In hat space you can fit another 16 extent records,
which is more than enough for 99.9% of normal files.

> > Changing a single mkfs parameter is far less work than maintaining
> > your own forks of multiple tools....
> 
> See above. I've since built a new RAID1 array with bigger and faster
> drives and am abandoning RAID5, but I still see no reason to waste disk
> space and seeks on larger data structures that are mostly empty space.

Well, if you think that inodes are taking too much space, then I
guess you'd be really concerned about the amount of space that
directories consume and how badly they get fragmented ;)

> > Until aging has degraded your filesystem til free space is
> > sufficiently fragmented that you can't allocate large extents any
> > more. Then you are completely screwed. :/
> 
> Once again, it is very difficult to see how keeping my long-lived files
> contiguous causes free space to become more fragmented, not less. Help
> me out here; it's highly counter intuitive, and more importantly I
> haven't seen that problem, at least not yet.

Initial allocations are done via the "allocate near" algorithm.  It
starts by finding the largest freespace extent that will hold the
allocation via a -size- match i.e. it will look for a match on the
size you are asking for. If there isn't a free space extent large
enough, it will fall back to searching for a large enough extent
near to where you are asking with an increasing search radius.

Once a free space extent is found, it then trims it for alignment to
stripe unit/stripe width. This generally leaves small, isolated
chunks of free space behind, as allocations are typically not stripe
unit/width length. Hence you end up with lots of little holes
around.

Subsequent sequential allocations use an exact block allocation
target to try to extend the contiguous allocation each file does.
For large files, this tends to keep the files contiguous, or at
least with multiple large extents rather than lots of small extents.

Then things like unrelated metadata allocations will tend to fill
those little holes, be it inodes, btree blocks, directory blocks or
attributes. If there aren't little holes (or you aren't using
alignment), they will simply sit between data extents. When you then
free the allocated data space, you've still got that unrelated
metadata lying around, and the free space is now somewhat
fragmented. This pattern gets worse as the filesystem ages.

Delayed allocation reduces the impact of this problem because it
reduces the amount of on-disk metadata modifications that occur
during normal operations. It also allows things like directory and
inode extent allocation during creates (e.g. untaring) to avoid
interleaving with data allocations, so directory and inode extents
tend to cluster and be more contiguous and not fill holes between
data extents. This means that you are less likely to get sparse
metadata blocks fragmenting free space, metadata read and write IO
is more likely to be clustered effectively (better IO performance),
and so on. IOWs, there are many reasons why delayed allocation
reduces the effects of filesystem aging compared to up-front
preallocation....

> I have a few extremely large files (many GB) that cannot be allocated a
> contiguous area. That's probably because of xfs's strategy of scattering
> files around disk to allow room for growth, which fragments the free
> space.

I doubt it. An extent canbe at most 8GB on a 4kB filesystem, so
that's why you see multiple extents for large files. i.e. they
require multiple allocations....

> You seem to take personal offense to my use of fallocate(), which is
> hardly my intention.

Nothing personal at all.

> Did you perhaps write the xfs preallocation code
> that I'm bypassing?

No. People much smarter than me designed and wrote all this stuff.

What I'm commenting on is your implication (sweeping generalisation)
that preallocation should be used everywhere because it seems to
work for you. I don't like to let such statements stand
unchallenged, especially when there are very good reaÑons why it is
likely to be wrong.

I don't do this for my benefit - and I don't really care if you
benefit from it or not - but there's a lot of XFS users on this list
that might have be wondering "why isn't that done by default?".
Those people learn a lot from someone trying to explain why what one
person says is beneficial for their use cases might be considered
harmful to everyone else...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs