On Mon, 2010-10-04 at 21:13 +1100, Dave Chinner wrote: > When multiple concurrent streaming writes land in the same AG, > allocation of extents interleaves between inodes and causes > excessive fragmentation of the files being written. That instead of > getting maximally sized extents, we'll get writeback range sized > extents interleaved on disk. that is for four files A, B, C and D, > we'll end up with extents like: > > +---+---+---+---+---+---+---+---+---+---+---+---+ > A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 ..... > > instead of: > > +-----------+-----------+-----------+-----------+ > A B C D > > It is well known that using the allocsize mount option makes the > allocator behaviour much better and more likely to result in > the second layout above than the first, but that doesn't work in all > situations (e.g. writes from the NFS server). I think that we should > not be relying on manual configuration to solve this problem. > . . . (deleting some of your demonstration detail) > The same results occur for tests running 16 and 64 sequential > writers into the same AG - extents of 8GB in all files, so > this is a major improvement in default behaviour and effectively > means we do not need the allocsize mount option anymore. > > Worth noting is that the extents still interleave between files - > that problem still exists - but the size of the extents now means > that sequential read and write rates are not going to be affected > by excessive seeks between extents within each file. Just curious--do we have any current and meaningful information about the trade-off between the size of an extent and seek time? Obviously maximizing the extent size will maximize the bang (data read) for the buck (seek cost) but can we quantify that with current storage device specs? (This is really a theoretical aside...) > Given this demonstratably improves allocation patterns, the only > question that remains in my mind is exactly what algorithm to use to > scale the preallocation. The current patch records the last > prealloc size and increases the next one from that. While that > preovides good results, it will cause problems when interacting with > truncation. It also means that a file may have a substantial amount > of preallocatin beyond EOF - maybe several times the size of the > file. I honestly haven't looked into this yet, but can you expand on the truncation problems you mention? Is it that the preallocated blocks should be dropped and the scaling algorithm should be reset when a truncation occurs or something? > However, the current algorithm does work well when writing lots of > relatively small files (e.g. up to a few tens of megabytes), as > increasing the preallocation size fast reduces the chances of > interleaving small allocations. One thing that I keep wondering about as I think about this is what the effect is as the file system (or AG) gets full, and what level of "full" is enough to make any adverse effects of a change like this start to show up. The other thing is, what sort of workloads are reasonable things to use to gauge the effect? NFS is perhaps common, but it's unique in how it closes files all the time. What happens when there's a more "normal" (non-NFS) workload? For AGs with enough free space I suppose it's a win overall. > I've been thinking that basing the preallocation size on the current > file size - say preallocate half the size of the file, is a better > option once file sizes start to grow large (more than a few tens of > of megabytes), so maybe a combination of the two is a better idea > (increase exponentially up to default^2 (4MB prealloc), then take > min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size > so that we don't do excessive amounts of preallocation? I think basing it on the file size is a good idea, it scales the (initial) preallocation size to the specific file. This would assume that files tend to grow by amounts comparable to their size rather than suddenly and dramatically changing. That seems reasonable but I have nothing empirical to back up that assumption. Similarly, the assumption that once a file starts to grow you should rapidly increase the EOF preallocation goal seems sensible--certainly for the hindsight case of a stream of appends--but I have no proof that a normal use case wouldn't trigger this algorithm when it might be better not to. > -- > > We need to make the same write patterns result in equivalent > allocation patterns even when they come through the NFS server. > Right now the NFS server uses a file descriptor for each write that > comes across the wire. This means that the ->release function is > called after every write, and that means XFS will be truncating away > the speculative preallocation it did during the write. Hence we get > interleaving files and fragmentation. It could be useful to base the behavior on actual knowledge that a file system is being exported by NFS. But it may well be that other applications (like shell scripts that loop and append to the same file repeatedly) might benefit. > To avoid this problem, detect when the ->release function is being > called repeatedly on an inode that has delayed allocation > outstanding. If this happenÑ twice in a row, then don't truncate the > speculative allocation away. This ensures that the speculative > preallocation is preserved until the delalloc blocks are converted > to real extents during writeback. . . . I have a few other comments in my reviews of your two patches. . . . > Comments welcome. You got some... -Alex _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs