On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote: > $ uname -a > Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013 > i686 GNU/Linux Oh, it's 32 bit system. Things you don't know from the obfuscating codenames everyone uses these days... > $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0 > $ mount -t xfs /dev/md0 /tmp/diskmnt/ > $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s .... > $ cat /proc/mounts > (...) > /dev/md0 /tmp/diskmnt xfs > rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0 sunit/swidth is 512k/1MB > # same layout for other disks > $ fdisk -c -u /dev/sda .... > Device Boot Start End Blocks Id System > /dev/sda1 2048 20565247 10281600 83 Linux Aligned to 1 MB. > /dev/sda2 20565248 1953525167 966479960 83 Linux And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is aligned to 4k, though, so there shouldn't be any hardware RMW cycles. > $ xfs_info /dev/md0 > meta-data=/dev/md0 isize=256 agcount=32, agsize=15101312 blks > = sectsz=4096 attr=2 > data = bsize=4096 blocks=483239168, imaxpct=5 > = sunit=12 sunit/swidth of 512k/1MB, so it matches the MD device. > $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero > /tmp/diskmnt/filewr.zero: > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..2047999]: 2049056..4097055 0 (2049056..4097055) 2048000 01111 > FLAG Values: > 010000 Unwritten preallocated extent > 001000 Doesn't begin on stripe unit > 000100 Doesn't end on stripe unit > 000010 Doesn't begin on stripe width > 000001 Doesn't end on stripe width > # this does not look good, does it? Yup, looks broken. /me digs through git. Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke the code that sets stripe unit alignment for the initial allocation way back in 3.2. [ Hmmm, that would explain the very occasional failure that generic/223 throws outi (maybe once a month I see it fail). ] Which means MD is doing RMW cycles for it's parity calculations, and that's where performance is going south. Current code: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 1056..2098207 0 (1056..2098207) 2097152 11111 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec) $ Which indicates that even if we take direct IO based allocation out of the picture, the allocation does not get aligned properly. This in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k. With a fixed kernel: $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 6293504..8390655 0 (6293504..8390655) 2097152 10000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec) $ It;s clear we have completely stripe swidth aligned allocation and it's 25% faster. Take fallocate out of the picture so the direct IO does the allocation: $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec) testfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2097151]: 2099200..4196351 0 (2099200..4196351) 2097152 00000 FLAG Values: 010000 Unwritten preallocated extent 001000 Doesn't begin on stripe unit 000100 Doesn't end on stripe unit 000010 Doesn't begin on stripe width 000001 Doesn't end on stripe width It's slower than with preallocation (no surprise - no allocation overhead per write(2) call after preallocation is done) but the allocation is still correctly aligned. The patch below should fix the unaligned allocation problem you are seeing, but because XFS defaults to stripe unit alignment for large allocations, you might still see RMW cycles when it aligns to a stripe unit that is not the first in a MD stripe. I'll have a quick look at fixing that behaviour when the swalloc mount option is specified.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx xfs: align initial file allocations correctly. From: Dave Chinner <dchinner@xxxxxxxxxx> The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/xfs/xfs_bmap.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index 3ef11b2..8401f11 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent( * blocks at the end of the file which do not start at the previous data block, * we will try to align the new blocks at stripe unit boundaries. * - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be * at, or past the EOF. */ STATIC int @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof( bma->aeof = 0; error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec, &is_empty); - if (error || is_empty) + if (error) return error; + if (is_empty) { + bma->aeof = 1; + return 0; + } + /* * Check if we are allocation or past the last extent, or at least into * the last delayed allocated extent. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html