On Fri, Jun 03, 2011 at 03:28:54PM -0700, Phil Karn wrote: > On 6/2/11 7:54 PM, Dave Chinner wrote: > > > There are definitely cases where it helps for preventing > > fragmenting, but as a sweeping generalisation it is very, very > > wrong. > > Well, if I ever see that in practice I'll change my procedures. > > > Do you do that for temporary object files when you build <program X> > > from source? > > No, that would involve patching gcc to use fallocate(). I could be wrong > -- I don't know much about gcc internals -- but I think most temp files > go on /tmp, which is not xfs. As I clearly said, I patched only a few > file copy programs like rsync that I use to create long-lived files. I > can't see why the upstream maintainers of those programs shouldn't > accept patches to incorporate fallocate() as long as care is taken to > avoid calling the POSIX version and no other harm is done on file > systems or OSes that don't support it. They are trying, but, well, the file corruption problems seen on 2.6.38/.39 kernels that are the result of them using fiemap/fallocate don't inspire me with confidence.... > > away by simply increasing the XFS inode size at mkfs time? And that > > there is almost no performance penalty for doing this? Instead, it > > seems you found a hammer named fallocate() and proceeded to treat > > every tool you have like a nail. :) > > You do realize that I started experimenting with attributes well *after* > I had built XFS on a 6 GB (net) RAID5 that took over a week of solid > copying to load to 50%? I had noticed the inode size parameter to > mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new > file system with bigger inodes and copy all my data (again) just to > waste more space on largely empty inodes and, more importantly, require > many more disk seeks and reads to walk through them all. > > The default xfs inode is 256 bytes. That means a single 4KiB block read > fetches 16 inodes at once. Making each inode 512 bytes means reading > only 8 inodes in each 4KiB block. That's arithmetic. XFS does not do inode IO like that, so your logic is flawed. Firstly, inodes are read and written in clusters of 8k, and contiguous inode clusters are merged during IO by the elevator. Metadata blocks are heavily sorted before being issued by for writeback, so we get excellent large IO patterns even for metadata IO. Under heavy file create workloads, I'm seeing XFS consistently write metadata to disk in 320k IOs - the maximum IO size my storage subsystem will allow. e.g. a couple of instructive graphs from Chris Mason for a parallel file create workload: http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.png http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.ogg The fact that ~5000 IOPS is being sustained with only 30-100 seeks/s indicates that the elevator merging is merging roughly 50-100 individual IOs together into each physical IO. This will happen regardless of inode size, so inode/metadata writeback under these workloads tends to be limited by bandwidth, not IOPS.... Reads might be a bit more random, but due to inodes being allocated in larger chunks (64 inodes at a time) and temporal locality effects due to sequential allocation by apps like rsync, then typically reads occur to localised areas as well and hit track caches or RAID controller readahead windows. > And I'd still have no guarantee of keeping my attributes in the inodes > without some limit on the size of the extent list. going from 256 -> 512 byte inodes gives you 256 bytes more space for attributes and extents, which in your case woul dbe entirely for data extents. In hat space you can fit another 16 extent records, which is more than enough for 99.9% of normal files. > > Changing a single mkfs parameter is far less work than maintaining > > your own forks of multiple tools.... > > See above. I've since built a new RAID1 array with bigger and faster > drives and am abandoning RAID5, but I still see no reason to waste disk > space and seeks on larger data structures that are mostly empty space. Well, if you think that inodes are taking too much space, then I guess you'd be really concerned about the amount of space that directories consume and how badly they get fragmented ;) > > Until aging has degraded your filesystem til free space is > > sufficiently fragmented that you can't allocate large extents any > > more. Then you are completely screwed. :/ > > Once again, it is very difficult to see how keeping my long-lived files > contiguous causes free space to become more fragmented, not less. Help > me out here; it's highly counter intuitive, and more importantly I > haven't seen that problem, at least not yet. Initial allocations are done via the "allocate near" algorithm. It starts by finding the largest freespace extent that will hold the allocation via a -size- match i.e. it will look for a match on the size you are asking for. If there isn't a free space extent large enough, it will fall back to searching for a large enough extent near to where you are asking with an increasing search radius. Once a free space extent is found, it then trims it for alignment to stripe unit/stripe width. This generally leaves small, isolated chunks of free space behind, as allocations are typically not stripe unit/width length. Hence you end up with lots of little holes around. Subsequent sequential allocations use an exact block allocation target to try to extend the contiguous allocation each file does. For large files, this tends to keep the files contiguous, or at least with multiple large extents rather than lots of small extents. Then things like unrelated metadata allocations will tend to fill those little holes, be it inodes, btree blocks, directory blocks or attributes. If there aren't little holes (or you aren't using alignment), they will simply sit between data extents. When you then free the allocated data space, you've still got that unrelated metadata lying around, and the free space is now somewhat fragmented. This pattern gets worse as the filesystem ages. Delayed allocation reduces the impact of this problem because it reduces the amount of on-disk metadata modifications that occur during normal operations. It also allows things like directory and inode extent allocation during creates (e.g. untaring) to avoid interleaving with data allocations, so directory and inode extents tend to cluster and be more contiguous and not fill holes between data extents. This means that you are less likely to get sparse metadata blocks fragmenting free space, metadata read and write IO is more likely to be clustered effectively (better IO performance), and so on. IOWs, there are many reasons why delayed allocation reduces the effects of filesystem aging compared to up-front preallocation.... > I have a few extremely large files (many GB) that cannot be allocated a > contiguous area. That's probably because of xfs's strategy of scattering > files around disk to allow room for growth, which fragments the free > space. I doubt it. An extent canbe at most 8GB on a 4kB filesystem, so that's why you see multiple extents for large files. i.e. they require multiple allocations.... > You seem to take personal offense to my use of fallocate(), which is > hardly my intention. Nothing personal at all. > Did you perhaps write the xfs preallocation code > that I'm bypassing? No. People much smarter than me designed and wrote all this stuff. What I'm commenting on is your implication (sweeping generalisation) that preallocation should be used everywhere because it seems to work for you. I don't like to let such statements stand unchallenged, especially when there are very good reaÑons why it is likely to be wrong. I don't do this for my benefit - and I don't really care if you benefit from it or not - but there's a lot of XFS users on this list that might have be wondering "why isn't that done by default?". Those people learn a lot from someone trying to explain why what one person says is beneficial for their use cases might be considered harmful to everyone else... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs