Re: Making discard/fstrim reliable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 02, 2014 at 02:18:40PM -0400, Jeff Moyer wrote:
> "Richard W.M. Jones" <rjones@xxxxxxxxxx> writes:
> 
> Hi, Richard,
> 
> > virt-sparsify is a tool for trimming free space in virtual disk
> > images.  The new implementation uses vfs/kernel/qemu discard support.
> > Essentially it does:
> >
> 
> Presumably there's a "start guest" step here that's missing?

Yup, it starts up a small appliance to do these operations.

> >   for each filesystem:
> >     mount -o discard $fs /mnt
> 
> What is $fs?  Do you pass in a list of devices?

Yes and no.  We examine the partitions, logical volumes and so on in
order to get a list of mountable filesystems, and then the list is
iterated over in this loop.  The precise code for finding the
filesystems is here:

https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45

^ That code is running on the host side.  It issues various calls to
the appliance side which are executed by code in multiple files here:

https://github.com/libguestfs/libguestfs/tree/master/daemon

> Also, you don't need to mount with -o discard in order to use fstrim.
> In fact, I'd recommend against doing that.
> 
> >     sync
> 
> Interesting.  Have you seen mount dirty inodes or something?

The sync is actually not material here.  However I included it for
completeness because it is an effective workaround for another
unreliability case where you delete some files before doing the
fstrim, and ext4 is slow enough that the files you remove don't return
space to the host.  The relevant code is:

https://github.com/libguestfs/libguestfs/blob/master/daemon/fstrim.c#L53

> >     fstrim /mnt
> >     umount /mnt
> >   sync
> >   # qemu is killed after sync returns
> >
> > Although typing these commands by hand works fine, when you run them
> > from a program the fstrim doesn't happen all the way down the stack
> > reliably.  Mostly it works, but sometimes it only trims some space
> > from the host file.
> 
> What is in the stack?  Are you using qcow2 images, plain files, device
> mapper, anything else?

In the test case it is recent kernel -> virtio-scsi -> qemu -> raw
format local file stored on host filesystem (ext4 on the test machine).

> Which file systems are you testing, and are they
> used in the host, the guest or both?

ext4 guest and host in this case.

> How are you checking for success?

We measure the file size (stat.st_blocks) on the host during the test.

There are various thresholds which count as success (see test script
linked below).  In the case where it is failing it's hardly discarding
any blocks, although it does discard some.

> Do you have a golden image you start with so that your test case is
> repeatable?

We create images on the fly, but yes I'm confident that the test is
repeatable (although that doesn't mean it is failing on every run --
it's a race condition of some sort).  The test code is here:

https://github.com/libguestfs/libguestfs/blob/master/tests/discard/test-fstrim.pl

> > It appears that when the host is slow / under load, the problem
> > happens more frequently.  Also it may happen more frequently on i686
> > than on x86-64 (possibly also due to speed of host).
> 
> I don't know of any reason that any of the variables you listed would
> affect the reliability at all.  As far as I can tell, fstrim is a
> synchronous ioctl.  I believe the only reason space wouldn't be freed is
> if the fs is fragmented in such a way as to not meet the minimum trim
> granularity of the underlying device.

It's a freshly created filesystem so I guess it's not likely to be
fragmented.

I suspect it's something to do with how we kill qemu.  Requests are in
flight somewhere.  Just not sure how we sync "enough" to make sure
everything is on the host.  FWIW here is the elaborate sync dance we
currently do to work around bugs present and past:

https://github.com/libguestfs/libguestfs/blob/master/daemon/sync.c#L54

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux