Re: XFS fragmentation on file append

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 8 Apr 2014 15:31:18 +1000

On Mon, Apr 07, 2014 at 11:42:02PM -0400, Keyur Govande wrote:
> On Mon, Apr 7, 2014 at 9:50 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > [cc the XFS mailing list <xfs@xxxxxxxxxxx>]
> >
> > On Mon, Apr 07, 2014 at 06:53:46PM -0400, Keyur Govande wrote:
> >> Hello,
> >>
> >> I'm currently investigating a MySQL performance degradation on XFS due
> >> to file fragmentation.
> >>
> >> The box has a 16 drive RAID 10 array with a 1GB battery backed cache
> >> running on a 12 core box.
> >>
> >> xfs_info shows:
> >> meta-data=/dev/sda4    isize=256    agcount=24, agsize=24024992 blks
> >>                =                 sectsz=512   attr=2, projid32bit=0
> >> data         =                 bsize=4096   blocks=576599552, imaxpct=5
> >>                =                 sunit=16     swidth=512 blks
> >> naming   = version 2     bsize=4096   ascii-ci=0
> >> log         = internal       bsize=4096   blocks=281552, version=2
> >>              =                   sectsz=512   sunit=16 blks, lazy-count=1
> >> realtime = none            extsz=4096   blocks=0, rtextents=0
> >>
> >> The kernel version is: 3.14.0-1.el6.elrepo.x86_64 and the XFS
> >> partition is mounted with: rw,noatime,allocsize=128m,inode64,swalloc.
> >> The partition is 2TB in size and 40% full to simulate production.
> >>
> >> Here's a test program that appends 512KB like MySQL does (write and
> >> then fsync). To exacerbate the issue, it loops a bunch of times:
> >> https://gist.github.com/keyurdg/961c19175b81c73fdaa3
> >>
> >> When run, this creates ~9500 extents most of length 1024.
> >
> > 1024 of what? Most likely it is 1024 basic blocks, which is 512KB,
> > the size of your writes.
> 
> Yeah, 1024 basic blocks of 512 bytes each.
> 
> >
> > Could you post the output of the xfs_bmap commands you are using to
> > get this information?
> 
> I'm getting the extent information via xfs_bmap -v <file name>. Here's
> a sample: https://gist.github.com/keyurdg/291b2a429f03c9a649ad

Yup, looks like fragmented free space so it's only finding islands
of 512kb of freespace near to the inode to allocate out of.

Can you post the output of /proc/mounts so I can check what the
allocator behaviour is being used?

> >> cat'ing the
> >> file to /dev/null after dropping the caches reads at an average of 75
> >> MBps, way less than the hardware is capable of.
> >
> > What you are doing is "open-seekend-write-fsync-close".  You haven't
> > told the filesystem you are doing append writes (O_APPEND, or the
> > append inode flag) so it can't optimise for them.
> 
> I tried this; adding O_APPEND the the open() in the pathological
> pwrite.c makes no difference to the extent allocation and hence the
> read performance.

Yeah, I had a look at what XFS does and in the close path it doesn't
know that the FD was O_APPEND because that state is available to the
->release path.

> > You are also cleaning the file before closing it, so you are
> > defeating the current heuristics that XFS uses to determine whether
> > to remove speculative preallocation on close() - if the inode is
> > dirty at close(), then it won't be removed. Hence speculative
> > preallocation does nothing for your IO pattern (i.e. the allocsize
> > mount option is completely useless). Remove the fsync and you'll
> > see your fragmentation problem go away completely.
> 
> I agree, but the MySQL data files (*.ibd) on our production cluster
> are appended to in bursts and they have thousands of tiny (512KB)
> extents. Getting rid of fsync is not possible given the use case.

Sure - just demonstrating that it's the fsync that is causing the
problems. i.e. it's application driven behaviour that the filesystem
can't easily detect and optimise...

> Arguably, MySQL does not close the files, but it writes out
> infrequently enough that I couldn't make a good and small test case
> for it. But the output of xfs_bmap is exactly the same as that of
> pwrite.c

Once you've fragmented free space, the only way to defrag it is to
remove whatever is using the space between the small freespace
extents. Usually the condition occurs when you intermix long lived
files with short lived files - removing the short lived files
results in fragmented free space that cannot be made contiguous
until both the short lived and long lived data has been removed.

If you want an idea of whether you've fragmented free space, use
the xfs_db freespace command. To see what each ag looks like
(change it to iterate all the ags in your fs):

$ for i in 0 1 2 3; do echo "*** AG $i:" ; sudo xfs_db -c "freesp -a $i -s" /dev/vda; done
*** AG 0:
   from      to extents  blocks    pct
      1       1     129     129   0.02
      2       3     119     283   0.05
      4       7     125     641   0.11
      8      15      93     944   0.16
     16      31      64    1368   0.23
     32      63      53    2300   0.39
     64     127      21    1942   0.33
    128     255      16    3145   0.53
    256     511       6    1678   0.28
    512    1023       1     680   0.11
  16384   32767       1   23032   3.87
 524288 1048576       1  558825  93.93
total free extents 629
total free blocks 594967
average free extent size 945.893
*** AG 1:
   from      to extents  blocks    pct
      1       1     123     123   0.01
      2       3     125     305   0.04
      4       7      79     418   0.05
......

And that will tell us what state your filesystem is in w.r.t.
freespace fragmentation...

> >> When I add a posix_fallocate before calling pwrite() as shown here
> >> https://gist.github.com/keyurdg/eb504864d27ebfe7b40a the file
> >> fragments an order of magnitude less (~30 extents), and cat'ing to
> >> /dev/null proceeds at ~1GBps.
> >
> > That should make no difference on XFS as you are only preallocating
> > the 512KB region beyond EOF that you are about to write into and
> > hence both delayed allocation and preallocation have the same
> > allocation target (the current EOF block). Hence in both cases the
> > allocation patterns should be identical if the freespace extent they
> > are being allocated out of are identical.
> >
> > Did you remove the previous test files and sync the filesystem
> > between test runs so that the available freespace was identical for
> > the different test runs? If you didn't then the filesystem allocated
> > the files out of different free space extents and hence you'll get
> > different allocation patterns...
> 
> I do clear everything and sync the FS before every run, and this is
> reproducible across multiple machines in our cluster.

Which indicates that you've probably already completely fragmented
free space in the filesystems.

> I've re-run the
> programs at least a 1000 times now, and every time get the same
> results. For some reason even the tiny 512KB fallocate() seems to be
> triggering some form of extent "merging" and placement.

Both methods of allocation shoul dbe doing the same thing - they use
exactly the same algorithm to select the next extent to allocate.
Can you tell me the:

	a) inode number of each of the target files that show
	different output
	b) the xfs_bmap output of the different files.

> > Alternatively, set an extent size hint on the log files to define
> > the minimum sized allocation (e.g. 32MB) and this will limit
> > fragmentation without you having to modify the MySQL code at all...
> >
> 
> I tried enabling extsize to 32MB, but it seems to make no difference.
> [kgovande@host]# xfs_io -c "extsize" /var/lib/mysql/xfs/plain_pwrite.werr
> [33554432] /var/lib/mysql/xfs/plain_pwrite.werr
> [kgovande@host]# xfs_bmap -v /var/lib/mysql/xfs/*.werr  | wc -l
> 20001
> [kgovande@host]# sync; echo 3 > /proc/sys/vm/drop_caches; pv
> plain_pwrite.werr > /dev/null
> 9.77GB 0:02:41 [61.7MB/s] [========================================>] 100%

Ah, extent size hints are not being considered in
xfs_can_free_eofblocks(). I suspect they should be, and that would
fix the problem.

Can you add this to xfs_can_free_eofblocks() in your kernel and see
what happens?

 	/* prealloc/delalloc exists only on regular files */
 	if (!S_ISREG(ip->i_d.di_mode))
 		return false;

+	if (xfs_get_extsz_hint(ip))
+		return false;
+
 	/*
 	 * Zero sized files with no cached pages and delalloc blocks will not
 	 * have speculative prealloc/delalloc blocks to remove.
 	 */

If that solves the problem, then I suspect that we might need to
modify this code to take into account the allocsize mount option as
well...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs