When multiple concurrent streaming writes land in the same AG, allocation of extents interleaves between inodes and causes excessive fragmentation of the files being written. That instead of getting maximally sized extents, we'll get writeback range sized extents interleaved on disk. that is for four files A, B, C and D, we'll end up with extents like: +---+---+---+---+---+---+---+---+---+---+---+---+ A1 B1 C1 D1 A2 B2 C2 A3 D2 C3 B3 D3 ..... instead of: +-----------+-----------+-----------+-----------+ A B C D It is well known that using the allocsize mount option makes the allocator behaviour much better and more likely to result in the second layout above than the first, but that doesn't work in all situations (e.g. writes from the NFS server). I think that we should not be relying on manual configuration to solve this problem. To demonstrate, writing 4 x 64GB files in parallel (16TB volume, inode64 so all files land in same AG, 700MB/s write speed) $ for i in `seq 0 1 3`; do > dd if=/dev/zero of=/mnt/scratch/test.$i bs=64k count=1048576 & > done .... results in: $ for i in `seq 0 1 3`; do > sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l > done 777 196 804 784 $ This shows an average extent size on three of files of 80MB, and 320MB for the other file. The level of fragmentation varies throughout the files, and varies greatly from run to run. To demonstrate allocsize=1g: $ for i in `seq 0 1 3`; do > sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l > done 64 64 64 64 $ Which is 64x1GB extents per file, as we would expect. However, we can do better than that - with this dynamic speculative preallocation patch: $ for i in `seq 0 1 3`; do > sudo xfs_bmap -vvp /mnt/scratch/test.$i | grep ": \[" | wc -l > done 9 9 9 9 $ Which gives extent sizes of a maximal 8GB (i.e. perfect): $ sudo xfs_bmap -vv /mnt/scratch/test.0 /mnt/scratch/test.0: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..16777207]: 96..16777303 0 (96..16777303) 16777208 1: [16777208..33554295]: 91344616..108121703 0 (91344616..108121703) 16777088 2: [33554296..50331383]: 158452968..175230055 0 (158452968..175230055) 16777088 3: [50331384..67108471]: 225561320..242338407 0 (225561320..242338407) 16777088 4: [67108472..83885559]: 292669672..309446759 0 (292669672..309446759) 16777088 5: [83885560..100662647]: 359778024..376555111 0 (359778024..376555111) 16777088 6: [100662648..117439735]: 426886376..443663463 0 (426886376..443663463) 16777088 7: [117439736..134216823]: 510771816..527548903 0 (510771816..527548903) 16777088 8: [134216824..134217727]: 594657256..594658159 0 (594657256..594658159) 904 $ The same results occur for tests running 16 and 64 sequential writers into the same AG - extents of 8GB in all files, so this is a major improvement in default behaviour and effectively means we do not need the allocsize mount option anymore. Worth noting is that the extents still interleave between files - that problem still exists - but the size of the extents now means that sequential read and write rates are not going to be affected by excessive seeks between extents within each file. Given this demonstratably improves allocation patterns, the only question that remains in my mind is exactly what algorithm to use to scale the preallocation. The current patch records the last prealloc size and increases the next one from that. While that preovides good results, it will cause problems when interacting with truncation. It also means that a file may have a substantial amount of preallocatin beyond EOF - maybe several times the size of the file. However, the current algorithm does work well when writing lots of relatively small files (e.g. up to a few tens of megabytes), as increasing the preallocation size fast reduces the chances of interleaving small allocations. I've been thinking that basing the preallocation size on the current file size - say preallocate half the size of the file, is a better option once file sizes start to grow large (more than a few tens of of megabytes), so maybe a combination of the two is a better idea (increase exponentially up to default^2 (4MB prealloc), then take min(max(i_size / 2, default^2), XFS_MAXEXTLEN) as the prealloc size so that we don't do excessive amounts of preallocation? -- We need to make the same write patterns result in equivalent allocation patterns even when they come through the NFS server. Right now the NFS server uses a file descriptor for each write that comes across the wire. This means that the ->release function is called after every write, and that means XFS will be truncating away the speculative preallocation it did during the write. Hence we get interleaving files and fragmentation. To avoid this problem, detect when the ->release function is being called repeatedly on an inode that has delayed allocation outstanding. If this happenÑ twice in a row, then don't truncate the speculative allocation away. This ensures that the speculative preallocation is preserved until the delalloc blocks are converted to real extents during writeback. The result of this is that concurrent files written by NFS will tend to have a small first extent (due to specultive prealloc being truncated once), followed by 4-8GB extents that interleave identically to the above local dd exmaples. I have tested this for 4, 16 and 64 concurrent writers from multiple NFS clients. The result for 2 clients each writing 16x16GB files (32 all up): $ for i in `seq 0 1 31`; do > sudo xfs_bmap -vv /mnt/scratch/test.$i |grep ": \[" | wc -l > done | uniq -c 1 2 31 3 Mostly a combination of 4GB and 8GB extents, instead of severe fragmentation. The typical layout was: /mnt/scratch/test.1: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..8388607]: 225562280..233950887 0 (225562280..233950887) 8388608 1: [8388608..25165815]: 410111608..426888815 0 (410111608..426888815) 16777208 2: [25165816..33554431]: 896648152..905036767 0 (896648152..905036767) 8388616 These results are using NFSv3, and the per-file write rate is only ~3MB/s. Hence it can be seen that the dynamic preallocation works for both high and low per-file write throughput. Comments welcome. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs