Proposal to improve filesystem/block snapshot interaction

Greg Banks <gnb@xxxxxxx> · Tue, 30 Oct 2007 12:04:53 +1100

G'day,

A number of people have already seen this; I'm posting for wider
comment and to move some interesting discussion to a public list.

I'll apologise in advance for the talk about SGI technologies (including
proprietary ones), but all the problems mentioned apply to in-tree
technologies too.

This proposal seeks to solve three problems in our NAS server product
due to the interaction of the filesystem (XFS) and the block-based
snapshot feature (XVM snapshot).  It's based on discussions held with
various people over the last few weeks, including Roger Strassburg,
Christoph Hellwig, David Chinner, and Donald Douwsma.

a)  The first problem is the server's behaviour when a filesystem
    which is subject to snapshot is written to, and the snapshot
    repository runs out of room.

    The failure mode can be quite severe.  XFS issues a metadata write
    to the block device, triggering a Copy-On-Write operation in the
    XVM snapshot element, which because of the full repository fails
    with EIO.  When XFS sees the failure it shuts down the filesystem.
    All subsequent attempts to perform IO to the filesystem block
    indefinitely.  In particular any NFS server thread will block
    and never reply to the NFS client.  The NFS client will retry,
    causing another NFS server thread to block, and repeat until every
    NFS server thread is blocked.  At this point all NFS service for
    all filesystems ceases.

    See PV 958220 and PV 958140 for a description of this problem and
    some of the approaches which have been discussed for resolving it.

b)  The second problem is that certain common combinations of
    filesystem operations can cause large wastes of space in the XVM
    snapshot repository.

    Examples include writing the same file twice with dd, or writing
    a new file and deleting it.  The cause is the inability of the
    XVM snapshot code to be able to free regions in the snapshot
    repository that are no longer in use by the filesystem; this
    information is simply not available within the block layer.

    Note that problem b) also contributes to problem a) by increasing
    repository usage and thus making it easier to encounter an
    out-of-space condition on the repository.

c)  The third problem is an unfortunate interaction between an XFS
    internal log and block snapshots.

    The log is a fixed region of the block device which is written as
    a side effect of a great many different filesystem operations.
    The information written there has no value and is not even
    read until and unless log recovery needs to be performed after
    the server has crashed.  This means the log does not need to be
    preserved by the block feature snapshot (because at the point in
    time when the snapshot is taken, log recovery must have already
    happened).  In fact the correct procedure when mounting a read-only
    snapshot is to use the "norecovery" option to prevent any attempt
    to read the log (although the NAS server software actually doesn't
    do this).

    However, because the block device layer doesn't have enough
    information to know any better, the first pass of writes to the log
    are subjected to Copy-On-Write.  This has two undesirable effects.
    Firstly, it increases the amount of snapshot repository space
    used by each snapshot, thus contributing to problem a).  Secondly,
    it puts a significant performance penalty on filesystem metadata
    operations for some time after each snapshot is taken; given
    that the NAS server can be configured to take regular frequent
    snapshots this may mean all of the time.

    An obvious solution is to use an external XFS log, but this quite
    inconvenient for the NAS server software to arrange.  For one
    thing, we would need to construct a separate external log device
    for the main filesystem and one for each mounted snapshot.

Note that these problems are not specific to XVM but will be
encountered by any Linux block-COWing snapshot implementation.
For example the DM snapshot implementation is documented to suffer from
problem a).  From the linux/Documentation/device-mapper/snapshot.txt:

> <COW device> will often be smaller than the origin and if it
> fills up the snapshot will become useless and be disabled,
> returning errors.  So it is important to monitor the amount of
> free space and expand the <COW device> before it fills up.

During discussions, it became clear that we could solve all three
of these problems by improving the block device interface to allow a
filesystem to provide the block device with dynamic block usage hints.

For example, when unlinking a file the filesystem could tell the
block device a hint of the form "I'm about to stop using these
blocks".  Most block devices would silently ignore these hints, but
a snapshot COW implementation (the "copy-on-write" XVM element or
the "snapshot-origin" dm target) could use them to help avoid these
problems.  For example, the response to the "I'm about to stop using
these blocks" hint could be to free the space used in the snapshot
repository for unnecessary copies of those blocks.

Of course snapshot cow elements may be part of more generic element
trees.  In general there may be more than one consumer of block usage
hints in a given filesystem's element tree, and their locations in that
tree are not predictable.  This means the block extents mentioned in
the usage hints need to be subject to the block mapping algorithms
provided by the element tree.  As those algorithms are currently
implemented using bio mapping and splitting, the easiest and simplest
way to reuse those algorithms is to add new bio flags.

First we need a mechanism to indicate that a bio is a hint rather
than a real IO.  Perhaps the easiest way is to add a new flag to
the bi_rw field:

#define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */

We'll also need a field to tell us which kind of hint the bio
represents.  Perhaps a new field could be added, or perhaps the top
16 bits of bi_rw (currently used to encode the bio's priority, which
has no meaning for hint bios) could be reused.  The latter approach
may allow hints to be used without modifying the bio structure or
any code that uses it other than the filesystem and the snapshot
implementation.  Such a property would have obvious advantages for
our NAS server software, where XFS and XVM modules are provided but
the other users of struct bio are stock SLES code.

Next we'll need three bio hints types with the following semantics.

BIO_HINT_ALLOCATE
    The bio's block extent will soon be written by the filesystem
    and any COW that may be necessary to achieve that should begin
    now.  If the COW is going to fail, the bio should fail.  Note
    that this provides a way for the filesystem to manage when and
    how failures to COW are reported.

BIO_HINT_RELEASE
    The bio's block extent is no longer in use by the filesystem
    and will not be read in the future.  Any storage used to back
    the extent may be released without any threat to filesystem
    or data integrity.

BIO_HINT_DONTCOW
    (the Bart Simpson BIO).  The bio's block extent is not needed
    in mounted snapshots and does not need to be subjected to COW.

Here's how these proposed hints help solve the abovementioned problems.

Problem a) The filesystem gives the BIO_HINT_ALLOCATE hint to the block
device when preparing to write to blocks and when allocating blocks.
The snapshot implementation checks whether COW is necessary, and if
so performs it immediately.  If the COW fails due to a lack of space
in the snapshot repository, the bio fails.  This can be caught in the
filesystem and reported to userspace (or the NFS server) as ENOSPC via
the existing mechanisms.  Filesystem shutdown is no longer necessary.

Problem b) is solved by the filesystem giving the BIO_HINT_RELEASE
hint to the block device every time it unmaps blocks in xfs_bunmapi.
The snapshot implementation can then free unnecessary copies of those
blocks.

Problem c) is solved by the filesystem giving to the block device a
BIO_HINT_DONTCOW hint describing the block extent of the internal log,
at filesystem mount time.  The snapshot implementation marks that
extent, and subsequent writes to those blocks do not cause COWs but
proceed directly to the origin filesystem.

Comments?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html