On Tue, Jan 22, 2013 at 02:04:12PM -0500, Brian Foster wrote: > On 01/21/2013 07:53 AM, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > This is an RFC that follow sup from a conversion Eric and I had on > > IRC. The idea is to prevent EOF speculative preallocation from > > triggering larger allocations on IO patterns of > > truncate--to-zero-seek-write-seek-write-.... which results in > > non-sparse files for large files. This, unfortunately, is the way cp > > behaves when copying sparse files, and it results in sub-optimal > > destination file layouts. > > > > What this code does is that it looks at the current extent over the > > new EOF location, and if it is a hole it turns off preallocation > > altogether. To avoid the next write from doing a large prealloc, it > > takes the size of subsequent preallocations from the current size of > > the existing EOF extent. IOWs, if you leave a hole in the file, it > > resets preallocation behaviour to the same as if it was a zero size > > file. > > > > I haven't fully tested this, so I'm not sure if it works exactly > > like I think it should, but I wanted to get this out there to get > > more eyes on it... > > > > On a quick test, I didn't quite get the behavior documented below. Is it > possible your test file had the initial extent preallocated from an xfs > module with the current preallocation scheme? No, I didn't run the test on an unmodified kernel. It is possible that I didn't remove it or truncate it between identical tests or tests with different offsets, though. <reruns test on a freshly mkfs'd fs> I get the same result as what I posted. Note that I am using a CRC enabled kernel and filesystem here, and it's 17TB in size, but that shouldn't affect the preallocation algorithm... $ sudo mkfs.xfs -f -l size=131072b,sunit=8 -m crc=1 /dev/vdc meta-data=/dev/vdc isize=512 agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 = crc=1 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o nobarrier,logbsize=256k /dev/vdc /mnt/scratch $ sudo xfs_io -f -c "pwrite 0 31m" -c "pwrite 33m 1m" -c "pwrite 128m 1m" -c "fiemap -v" /mnt/scratch/blah wrote 32505856/32505856 bytes at offset 0 31 MiB, 7936 ops; 0.0000 sec (1.036 GiB/sec and 271501.8816 ops/sec) wrote 1048576/1048576 bytes at offset 34603008 1 MiB, 256 ops; 0.0000 sec (738.007 MiB/sec and 188929.8893 ops/sec) wrote 1048576/1048576 bytes at offset 134217728 1 MiB, 256 ops; 0.0000 sec (55.772 MiB/sec and 14277.7468 ops/sec) /mnt/scratch/blah: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..65535]: 128..65663 65536 0x0 1: [65536..67583]: hole 2048 2: [67584..133119]: 67712..133247 65536 0x0 3: [133120..262143]: hole 129024 4: [262144..393215]: 262272..393343 131072 0x1 $ > > What I see is that sequential writes to a file disable preallocation > completely (so the first extent in the test below is 31m instead of > 32m). Digging a bit further, it seemed to be due to start_fsb always > being a hole. I hacked that a bit to read the extent of the block > immediately previous to the write offset (instead of the inode size), e.g.: > > start_fsb = XFS_B_TO_FSBT(mp, offset); > if (start_fsb) > start_fsb--; > > ... and I seem to get expected behavior, at least in the simple xfs_io test. I'll have a look at it if I get time before LCA, otherwise it will be a couple of weeks before I get back to it. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs