Re: Proposal to improve filesystem/block snapshot interaction

Roger Strassburg <rls@xxxxxxx> · Wed, 21 Nov 2007 00:43:52 +0100

Greg,

Sorry I didn't respond sooner - other things have gotten in the way of reading this thread.

See comments below.

Roger

Greg Banks wrote:
> On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
>> On Tuesday October 30, gnb@xxxxxxx wrote:
>>> Of course snapshot cow elements may be part of more generic element
>>> trees.  In general there may be more than one consumer of block usage
>>> hints in a given filesystem's element tree, and their locations in that
>>> tree are not predictable.  This means the block extents mentioned in
>>> the usage hints need to be subject to the block mapping algorithms
>>> provided by the element tree.  As those algorithms are currently
>>> implemented using bio mapping and splitting, the easiest and simplest
>>> way to reuse those algorithms is to add new bio flags.
>> So are you imagining that you might have a distinct snapshotable
>> elements, and that some of these might be combined by e.g. RAID0 into
>> a larger device, then a filesystem is created on that?
> 
> I was thinking more a concatenation than a stripe, but yes you could
> do such a thing, e.g. to parallelise the COW procedure.  We don't do
> any such thing in our product; the COW element is always inserted at
> the top of the logical element tree.
> 
>> I ask because my first thought was that the sort of communication you
>> want seems like it would be just between a filesystem and the block
>> device that it talks directly to, and as you are particularly
>> interested in XFS and XVM, should could come up with whatever protocol
>> you want for those two to talk to either other, prototype it, iron out
>> all the issues, then say "We've got this really cool thing to make
>> snapshots much faster - wanna share?"  and thus be presenting from a
>> position of more strength (the old 'code talks' mantra).
> 
> Indeed, code talks ;-)  I was hoping someone else would do that
> talking for me, though.
> 
>>> First we need a mechanism to indicate that a bio is a hint rather
>>> than a real IO.  Perhaps the easiest way is to add a new flag to
>>> the bi_rw field:
>>>
>>> #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */
>> Reminds me of the new approach to issue_flush_fn which is just to have
>> a zero-length barrier bio (is that implemented yet? I lost track).
>> But different as a zero length barrier has zero length, and your hints
>> have a very meaningful length.
> 
> Yes.
> 
>>> Next we'll need three bio hints types with the following semantics.
>>>
>>> BIO_HINT_ALLOCATE
>>>     The bio's block extent will soon be written by the filesystem
>>>     and any COW that may be necessary to achieve that should begin
>>>     now.  If the COW is going to fail, the bio should fail.  Note
>>>     that this provides a way for the filesystem to manage when and
>>>     how failures to COW are reported.
>> Would it make sense to allow the bi_sector to be changed by the device
>> and to have that change honoured.
>> i.e. "Please allocate 128 blocks, maybe 'here'" 
>>      "OK, 128 blocks allocated, but they are actually over 'there'".
> 
> That wasn't the expectation at all.  Perhaps "allocate" is a poor
> name.   "I have just allocated, deal with it" might be more appropriate.
> Perhaps BIO_HINT_WILLUSE or something.
> 
>> If the device is tracking what space is and isn't used, it might make
>> life easier for it to do the allocation.  Maybe even have a variant
>> "Allocate 128 blocks, I don't care where".
> 
> That kind of thing might perhaps be useful for flash, but I think
> current filesystems would have conniptions.
> 
>> Is this bio supposed to block until the copy has happened?  Or only
>> until the space of the copy has been allocated and possibly committed?
> 
> The latter.  The writes following will block until the COW has
> completed, or might be performed sufficiently later that the COW
> has meanwhile completed (I think this implies an extra state in the
> snapshot metadata to avoid double-COWing).  The point of the hint is
> to allow the snapshot code to test for running out of repo space and
> report that failure at a time when the filesystem is able to handle
> it gracefully.
> 
>> Or must it return without doing any IO at all?
> 
> I would expect it would be a useful optimisation to start the IO but
> not wait for it's completion, but that the first implementation would
> just do a space check.
> 
>>> BIO_HINT_RELEASE
>>>     The bio's block extent is no longer in use by the filesystem
>>>     and will not be read in the future.  Any storage used to back
>>>     the extent may be released without any threat to filesystem
>>>     or data integrity.
>> If the allocation unit of the storage device (e.g. a few MB) does not
>> match the allocation unit of the filesystem (e.g. a few KB) then for
>> this to be useful either the storage device must start recording tiny
>> allocations, or the filesystem should re-release areas as they grow.
>> i.e. when releasing a range of a device, look in the filesystem's usage
>> records for the largest surrounding free space, and release all of that.
> 
> Good point.  I was planning on ignoring this problem :-/ Given that
> current snapshot implementations waste *all* the blocks in deleted
> files, it would be an improvement to scavenge the blocks in large
> extents.  This is especially true for XFS which goes to some effort
> to achieve large linear extents.
> 
>> Would this be a burden on the filesystems?
> 
> I think so.  I would hope the hints could be done in a way which
> minimises the impact on filesystems, so that it would be easier to roll
> out.  That implies pushing the responsibility for being smart about
> combining partial deallocations down to the block device/snapshot code.
> Any comments, Roger?

I'm not sure how snapshot can really use a dealloc hint.  Whatever you're deallocating is in the base, but you want it to stay in the snapshot, since the purpose of a snapshot is to keep track of what was there before.

What makes more sense is to somehow pass a hint saying that the data being written is to space that wasn't allocated at the time the snapshot was created, but that would require the filesystem to have knowledge of the snapshot.  This would prevent copying data that doesn't contain meaningful data in the first place.  

>> Is my imagined disparity between block sizes valid?
> 
> Yep, at least for XFS and XVM.  If the space was used in lots of
> little files, this rounding would probably eat a lot of the savings.
> 
>> Would it be just as easy for the storage device to track small
>> allocation/deallocations?
>>
>>> BIO_HINT_DONTCOW
>>>     (the Bart Simpson BIO).  The bio's block extent is not needed
>>>     in mounted snapshots and does not need to be subjected to COW.
>> This seems like a much more domain-specific function that the other
>> two which themselves could be more generally useful
> 
> Agreed, I can't offhand think of a use other than internal logs.
> 
>> (I'm imagining
>> using hints from them to e.g. accelerate RAID reconstruction).
> 
> Ah, interesting idea: delete a file to speed up RAID recovery ;-)
> 
>> Surely the "correct" thing to do with the log is to put it on a separate
>> device which itself isn't snapshotted.
> 
> Indeed.
> 
>> If you have a storage manager that is smart enough to handle these
>> sorts of things, maybe the functionality you want is "Give me a
>> subordinate device which is not snapshotted, size X", then journal to
>> that virtual device.
> 
> This is usually better, but is not always convenient for a number of
> reasons.  For example, you might not have enough disks to build all
> of a base, a snapshot repo, and a log device.  Also, the log really
> needs to be safe, so you want it mirrored or RAID5, and you want it
> fast, and you want it on separate spindles, so it needs several disks;
> but now you're using terabytes of disk space for 128 MiB of log.

The log doesn't need to be on a separate disk, just a separate logical volume.  Also, you don't have to mirror the whole disk in order to mirror the log volume.  Snapshots are done per logical volume, not per physical disk.

>> I guess that is equally domain specific, but the difference is that if
>> you try to read from the DONTCOW part of the snapshot, you get bad
>> old data, where as if you try to access the subordinate device of a
>> snapshot, you get an IO error - which is probably safer.
> 
> I believe (Dave or Roger will correct me here) that XFS needs a log
> when you mount, and you get to either provide an external one or use
> the internal one.  So when you mount a snapshot of an XFS filesystem
> which was built with an external log, you need to provide a new
> external log device.  So the storage manager needs to allocate an
> external log device for each snapshot it allows.

That's correct.

>>> Comments?
>> On the whole it seems reasonably sane .... providing you are from the
>> school which believes that volume managers and filesystems should be
>> kept separate :-)
> 
> Yeah, I'm so old-school :-)
> 
> Greg.

-- 
Roger Strassburg  SGI Storage Systems Software  +49-89-46108-142
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html