Re: Proposal to improve filesystem/block snapshot interaction

Neil Brown <neilb@xxxxxxx> · Tue, 30 Oct 2007 15:16:06 +1100

On Tuesday October 30, gnb@xxxxxxx wrote:
> 
> Of course snapshot cow elements may be part of more generic element
> trees.  In general there may be more than one consumer of block usage
> hints in a given filesystem's element tree, and their locations in that
> tree are not predictable.  This means the block extents mentioned in
> the usage hints need to be subject to the block mapping algorithms
> provided by the element tree.  As those algorithms are currently
> implemented using bio mapping and splitting, the easiest and simplest
> way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say "We've got this really cool thing to make
snapshots much faster - wanna share?"  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

> 
> First we need a mechanism to indicate that a bio is a hint rather
> than a real IO.  Perhaps the easiest way is to add a new flag to
> the bi_rw field:
> 
> #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

> 
> Next we'll need three bio hints types with the following semantics.
> 
> BIO_HINT_ALLOCATE
>     The bio's block extent will soon be written by the filesystem
>     and any COW that may be necessary to achieve that should begin
>     now.  If the COW is going to fail, the bio should fail.  Note
>     that this provides a way for the filesystem to manage when and
>     how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. "Please allocate 128 blocks, maybe 'here'" 
     "OK, 128 blocks allocated, but they are actually over 'there'".

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
"Allocate 128 blocks, I don't care where".

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

> 
> BIO_HINT_RELEASE
>     The bio's block extent is no longer in use by the filesystem
>     and will not be read in the future.  Any storage used to back
>     the extent may be released without any threat to filesystem
>     or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

> 
> BIO_HINT_DONTCOW
>     (the Bart Simpson BIO).  The bio's block extent is not needed
>     in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the "correct" thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is "Give me a
subordinate device which is not snapshotted, size X", then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

> 
> Comments?

On the whole it seems reasonably sane .... providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html