On Tue, Apr 8, 2014 at 1:31 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote: >> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> > [cc the XFS mailing list <xfs@xxxxxxxxxxx>] >> > >> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote: >> >> Hello, >> >> >> >> I'm currently investigating a MySQL performance degradation on XFS due >> >> to file fragmentation. >> >> >> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache >> >> running on a 12 core box. >> >> >> >> xfs_info shows: >> >> meta-data=/dev/sda4 isize=256 agcount=24, agsize=24024992 blks >> >> = sectsz=512 attr=2, projid32bit=0 >> >> data = bsize=4096 blocks=576599552, imaxpct=5 >> >> = sunit=16 swidth=512 blks >> >> naming = version 2 bsize=4096 ascii-ci=0 >> >> log = internal bsize=4096 blocks=281552, version=2 >> >> = sectsz=512 sunit=16 blks, lazy-count=1 >> >> realtime = none extsz=4096 blocks=0, rtextents=0 >> >> >> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS >> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc. >> >> The partition is 2TB in size and 40% full to simulate production. >> >> >> >> Here's a test program that appends 512KB like MySQL does (write and >> >> then fsync). To exacerbate the issue, it loops a bunch of times: >> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3 >> >> >> >> When run, this creates ~9500 extents most of length 1024. >> > >> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB, >> > the size of your writes. >> >> Yeah, 1024 basic blocks of 512 bytes each. >> >> > >> > Could you post the output of the xfs_bmap commands you are using to >> > get this information? >> >> I'm getting the extent information via xfs_bmap -v <file name>. Here's >> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad > > Yup, looks like fragmented free space so it's only finding islands > of 512kb of freespace near to the inode to allocate out of. > > Can you post the output of /proc/mounts so I can check what the > allocator behaviour is being used? > >> >> cat'ing the >> >> file to /dev/null after dropping the caches reads at an average of 75 >> >> MBps, way less than the hardware is capable of. >> > >> > What you are doing is "open-seekend-write-fsync-close". You haven't >> > told the filesystem you are doing append writes (O_APPEND, or the >> > append inode flag) so it can't optimise for them. >> >> I tried this; adding O_APPEND the the open() in the pathological >> pwrite.c makes no difference to the extent allocation and hence the >> read performance. > > Yeah, I had a look at what XFS does and in the close path it doesn't > know that the FD was O_APPEND because that state is available to the > ->release path. > >> > You are also cleaning the file before closing it, so you are >> > defeating the current heuristics that XFS uses to determine whether >> > to remove speculative preallocation on close() - if the inode is >> > dirty at close(), then it won't be removed. Hence speculative >> > preallocation does nothing for your IO pattern (i.e. the allocsize >> > mount option is completely useless). Remove the fsync and you'll >> > see your fragmentation problem go away completely. >> >> I agree, but the MySQL data files (*.ibd) on our production cluster >> are appended to in bursts and they have thousands of tiny (512KB) >> extents. Getting rid of fsync is not possible given the use case. > > Sure - just demonstrating that it's the fsync that is causing the > problems. i.e. it's application driven behaviour that the filesystem > can't easily detect and optimise... > >> Arguably, MySQL does not close the files, but it writes out >> infrequently enough that I couldn't make a good and small test case >> for it. But the output of xfs_bmap is exactly the same as that of >> pwrite.c > > Once you've fragmented free space, the only way to defrag it is to > remove whatever is using the space between the small freespace > extents. Usually the condition occurs when you intermix long lived > files with short lived files - removing the short lived files > results in fragmented free space that cannot be made contiguous > until both the short lived and long lived data has been removed. > > If you want an idea of whether you've fragmented free space, use > the xfs_db freespace command. To see what each ag looks like > (change it to iterate all the ags in your fs): > > $ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done > *** AG 0: > from to extents blocks pct > 1 1 129 129 0.02 > 2 3 119 283 0.05 > 4 7 125 641 0.11 > 8 15 93 944 0.16 > 16 31 64 1368 0.23 > 32 63 53 2300 0.39 > 64 127 21 1942 0.33 > 128 255 16 3145 0.53 > 256 511 6 1678 0.28 > 512 1023 1 680 0.11 > 16384 32767 1 23032 3.87 > 524288 1048576 1 558825 93.93 > total free extents 629 > total free blocks 594967 > average free extent size 945.893 > *** AG 1: > from to extents blocks pct > 1 1 123 123 0.01 > 2 3 125 305 0.04 > 4 7 79 418 0.05 > ...... > > And that will tell us what state your filesystem is in w.r.t. > freespace fragmentation... > >> >> When I add a posix_fallocate before calling pwrite() as shown here >> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file >> >> fragments an order of magnitude less (~30 extents), and cat'ing to >> >> /dev/null proceeds at ~1GBps. >> > >> > That should make no difference on XFS as you are only preallocating >> > the 512KB region beyond EOF that you are about to write into and >> > hence both delayed allocation and preallocation have the same >> > allocation target (the current EOF block). Hence in both cases the >> > allocation patterns should be identical if the freespace extent they >> > are being allocated out of are identical. >> > >> > Did you remove the previous test files and sync the filesystem >> > between test runs so that the available freespace was identical for >> > the different test runs? If you didn't then the filesystem allocated >> > the files out of different free space extents and hence you'll get >> > different allocation patterns... >> >> I do clear everything and sync the FS before every run, and this is >> reproducible across multiple machines in our cluster. > > Which indicates that you've probably already completely fragmented > free space in the filesystems. > >> I've re-run the >> programs at least a 1000 times now, and every time get the same >> results. For some reason even the tiny 512KB fallocate() seems to be >> triggering some form of extent "merging" and placement. > > Both methods of allocation shoul dbe doing the same thing - they use > exactly the same algorithm to select the next extent to allocate. > Can you tell me the: > > a) inode number of each of the target files that show > different output > b) the xfs_bmap output of the different files. > >> > Alternatively, set an extent size hint on the log files to define >> > the minimum sized allocation (e.g. 32MB) and this will limit >> > fragmentation without you having to modify the MySQL code at all... >> > >> >> I tried enabling extsize to 32MB, but it seems to make no difference. >> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr >> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr >> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr | wc -l >> 20001 >> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv >> plain_pwrite.werr > /dev/null >> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100% > > Ah, extent size hints are not being considered in > xfs_can_free_eofblocks(). I suspect they should be, and that would > fix the problem. > > Can you add this to xfs_can_free_eofblocks() in your kernel and see > what happens? > > > /* prealloc/delalloc exists only on regular files */ > if (!S_ISREG(ip->i_d.di_mode)) > return false; > > + if (xfs_get_extsz_hint(ip)) > + return false; > + > /* > * Zero sized files with no cached pages and delalloc blocks will not > * have speculative prealloc/delalloc blocks to remove. > */ > > If that solves the problem, then I suspect that we might need to > modify this code to take into account the allocsize mount option as > well... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx Hey Dave, I spent some more time figuring out the MySQL write semantics and it doesn't open/close files often and initial test script was incorrect. It uses O_DIRECT and appends to the file; I modified my test binary to take this into account here: https://gist.github.com/keyurdg/54e0613e27dbe7946035 I've been testing on the 3.10 kernel. The set up is a empty 2 TB XFS partition. [root@dbtest09 linux-3.10.37]# xfs_info /dev/sda4 meta-data=/dev/sda4 isize=256 agcount=24, agsize=24024992 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=576599552, imaxpct=5 = sunit=16 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=281552, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root@dbtest09 linux-3.10.37]# cat /proc/mounts rootfs / rootfs rw 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=49573060k,nr_inodes=12393265,mode=755 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,relatime 0 0 /dev/sda2 / ext3 rw,noatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 /dev/sda1 /boot ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda4 /var/lib/mysql xfs rw,noatime,swalloc,attr2,inode64,logbsize=64k,sunit=128,swidth=4096,noquota 0 0 [root@dbtest09 linux-3.10.37]# xfs_io -c "extsize " /var/lib/mysql/xfs/ [0] /var/lib/mysql/xfs/ Here's how the first 3 AG's look like: https://gist.github.com/keyurdg/82b955fb96b003930e4f After a run of the dpwrite program, here's how the bmap looks like: https://gist.github.com/keyurdg/11196897 The files have nicely interleaved with each other, mostly XFS_IEXT_BUFSZ size extents. The average read speed is 724 MBps. After defragmenting the file to 1 extent, the speed improves 30% to 1.09 GBps. I noticed that XFS chooses the AG based on the parent directory's AG and only the next sequential one if there's no space available. A small patch that chooses the AG randomly fixes the fragmentation issue very nicely. All of the MySQL data files are in a single directory and we see this in Production where a parent inode AG is filled, then the sequential next, and so on. diff --git a/fs/xfs/xfs_ialloc.c b/fs/xfs/xfs_ialloc.c index c8f5ae1..7841509 100644 --- a/fs/xfs/xfs_ialloc.c +++ b/fs/xfs/xfs_ialloc.c @@ -517,7 +517,7 @@ xfs_ialloc_ag_select( * to mean that blocks must be allocated for them, * if none are currently free. */ - agno = pagno; + agno = ((xfs_agnumber_t) prandom_u32()) % agcount; flags = XFS_ALLOC_FLAG_TRYLOCK; for (;;) { pag = xfs_perag_get(mp, agno); I couldn't find guidance on the internet on how many allocation groups to use for a 2 TB partition, but this random selection won't scale for many hundreds of concurrently written files, but for a few heavily writtent-to files it works nicely. I noticed that for non-DIRECT_IO + every write fsync'd, XFS would cleverly keep doubling the allocation block size as the file kept growing. The "extsize" option seems to me a bit too static because the size of tables we use varies widely and large new tables come and go. Could the same doubling logic be applied for DIRECT_IO writes as well? I tried out this extremely rough patch based on the delayed write code; if you think this is reasonable I can try to make it more acceptable. It provides very nice performance indeed, for a 2GB file, here's how the bmap looks like: https://gist.github.com/keyurdg/ac6ed8536f864c8fffc8 diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 8f8aaee..2682f53 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -118,6 +118,16 @@ xfs_alert_fsblock_zero( return EFSCORRUPTED; } +STATIC int +xfs_iomap_eof_want_preallocate( + xfs_mount_t *mp, + xfs_inode_t *ip, + xfs_off_t offset, + size_t count, + xfs_bmbt_irec_t *imap, + int nimaps, + int *prealloc); + int xfs_iomap_write_direct( xfs_inode_t *ip, @@ -152,7 +162,32 @@ xfs_iomap_write_direct( offset_fsb = XFS_B_TO_FSBT(mp, offset); last_fsb = XFS_B_TO_FSB(mp, ((xfs_ufsize_t)(offset + count))); if ((offset + count) > XFS_ISIZE(ip)) { - error = xfs_iomap_eof_align_last_fsb(mp, ip, extsz, &last_fsb); + xfs_extlen_t new_extsz = extsz; + + if (!extsz) { + int prealloc; + xfs_bmbt_irec_t prealloc_imap[XFS_WRITE_IMAPS]; + + error = xfs_iomap_eof_want_preallocate(mp, ip, offset, count, + prealloc_imap, XFS_WRITE_IMAPS, &prealloc); + + if (prealloc) { + xfs_fileoff_t temp_start_fsb; + int temp_imaps = 1; + + temp_start_fsb = XFS_B_TO_FSB(mp, offset); + if (temp_start_fsb) + temp_start_fsb--; + + error = xfs_bmapi_read(ip, temp_start_fsb, 1, prealloc_imap, &temp_imaps, XFS_BMAPI_ENTIRE); + if (error) + return XFS_ERROR(error); + + new_extsz = prealloc_imap[0].br_blockcount << 1; + } + } + + error = xfs_iomap_eof_align_last_fsb(mp, ip, new_extsz, &last_fsb); if (error) return XFS_ERROR(error); } else { Cheers, Keyur. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html