Re: Proposal to improve filesystem/block snapshot interaction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote:
> On Monday 29 October 2007, Christoph Hellwig wrote:
> > ----- Forwarded message from Greg Banks <gnb@xxxxxxx> -----
> > 
> > Date: Thu, 27 Sep 2007 16:31:13 +1000
> > From: Greg Banks <gnb@xxxxxxx>
> > Subject: Proposal to improve filesystem/block snapshot interaction
> > To: David Chinner <dgc@xxxxxxxxxxxxxxxxx>, Donald Douwsma <donaldd@xxxxxxx>,
> >         Christoph Hellwig <hch@xxxxxxxxxxxxx>, Roger Strassburg <rls@xxxxxxx>
> > Cc: Mark Goodwin <markgw@xxxxxxx>,
> >         Brett Jon Grandbois <brettg@xxxxxxxxxxxxxxxxx>
> > 
> > 
> > 
> > This proposal seeks to solve three problems in our NAS server product
> > due to the interaction of the filesystem (XFS) and the block-based
> > snapshot feature (XVM snapshot).  It's based on discussions held with
> > various people over the last few weeks, including Roger Strassburg,
> > Christoph Hellwig, David Chinner, and Donald Douwsma.
> 
> Hi Greg,
> 
> Christoph forwarded me your mail, because I mentioned to him that
> I'm trying to come up with a similar change, and it might make sense
> to combine our efforts.

Excellent, thanks Christoph ;-)


> 
> > For example, when unlinking a file the filesystem could tell the
> > block device a hint of the form "I'm about to stop using these
> > blocks".  Most block devices would silently ignore these hints, but
> > a snapshot COW implementation (the "copy-on-write" XVM element or
> > the "snapshot-origin" dm target) could use them to help avoid these
> > problems.  For example, the response to the "I'm about to stop using
> > these blocks" hint could be to free the space used in the snapshot
> > repository for unnecessary copies of those blocks.
> 
> The case I'm interested in is the more specific case of 'erase',
> which is more of a performance optimization than a space optimization.
> When you have a flash medium, it's useful to erase a block as soon
> as it's becoming unused, so that a subsequent write will be faster.
> Moreover, on an MTD medium, you may not even be able to write to
> a block unless it has been erased before.

Spending the device's time to erase early, when the CPU isn't waiting
for it, instead of later, when it adds to effective write latency.
Makes sense.

> > Of course snapshot cow elements may be part of more generic element
> > trees.  In general there may be more than one consumer of block usage
> > hints in a given filesystem's element tree, and their locations in that
> > tree are not predictable.  This means the block extents mentioned in
> > the usage hints need to be subject to the block mapping algorithms
> > provided by the element tree.  As those algorithms are currently
> > implemented using bio mapping and splitting, the easiest and simplest
> > way to reuse those algorithms is to add new bio flags.
> > 
> > First we need a mechanism to indicate that a bio is a hint rather
> > than a real IO.  Perhaps the easiest way is to add a new flag to
> > the bi_rw field:
> > 
> > #define BIO_RW_HINT 	5   	/* bio is a hint not a real io; no pages */
> 
> My first thought was to do this on the request layer, not already
> on bio, but they can easily be combined, I guess.

My first thoughts were along similar lines, but I wasn't expecting
these hint bios to survive deep enough in the stack to need queuing
and thus visibility in struct request; I was expecting their lifetime
to be some passage and splitting through a volume manager and then
conversion to synchronous metadata operations.  Plus, hijacking bios
means not having to modify every single DM target to duplicate it's
block mapping algorithm.

Basically, I was thinking of loopback-like block mapping and not
considering flash.  I suppose for flash where there's a real erase
operation, you'd want to be queuing and that means a new request type.

> 
> > We'll also need a field to tell us which kind of hint the bio
> > represents.  Perhaps a new field could be added, or perhaps the top
> > 16 bits of bi_rw (currently used to encode the bio's priority, which
> > has no meaning for hint bios) could be reused.  The latter approach
> > may allow hints to be used without modifying the bio structure or
> > any code that uses it other than the filesystem and the snapshot
> > implementation.  Such a property would have obvious advantages for
> > our NAS server software, where XFS and XVM modules are provided but
> > the other users of struct bio are stock SLES code.
> > 
> > 
> > Next we'll need three bio hints types with the following semantics.
> > 
> > BIO_HINT_ALLOCATE
> >     The bio's block extent will soon be written by the filesystem
> >     and any COW that may be necessary to achieve that should begin
> >     now.  If the COW is going to fail, the bio should fail.  Note
> >     that this provides a way for the filesystem to manage when and
> >     how failures to COW are reported.
> > 
> > BIO_HINT_RELEASE
> >     The bio's block extent is no longer in use by the filesystem
> >     and will not be read in the future.  Any storage used to back
> >     the extent may be released without any threat to filesystem
> >     or data integrity.
> > 
> > BIO_HINT_DONTCOW
> >     (the Bart Simpson BIO).  The bio's block extent is not needed
> >     in mounted snapshots and does not need to be subjected to COW.
> >     
> 
> My code currently needs four flags, which don't match yours too much:
> 
> /*
>  * A number of different actions could be triggered by an erase request,
>  * depending on the underlying device. Each device specifies its
>  * capabilities with these flags, while a request specifies the options
>  * that are acceptable. If the logical AND from these two does not
>  * have any bits set, the request will result in
>  * an error.
>  */
> enum {
> 	/*
> 	 * Device may choose to ignore the request, subsequent writes
> 	 * may return the original data. This is meant to work on

Is this supposed to be "reads" ?

> 	 * any block device. When combined with other flags, the driver
> 	 * should only perform an actual erase if it makes sense
> 	 * from a performance perspective, e.g. speeding up subsequent
> 	 * writes.
> 	 */
> 	LB_ERASE_IGNORE		= 0x01,
> 	/*
> 	 * A subsequent read may return zero data for the erase,
> 	 * like on some high-level abstractions for flash memory,
> 	 * or a virtual device.
> 	 */
> 	LB_ERASE_ALL_ZERO	= 0x02,
> 	/*
> 	 * A subsequent read may return a block filled with 0xff,
> 	 * which is the typical behavior on raw NAND flash.
> 	 */
> 	LB_ERASE_ALL_ONE	= 0x04,
> 	/*
> 	 * The device may reject a read request for an erased block
> 	 * until the block has been written again. This is typical
> 	 * for NAND flash with builtin ECC checks, or for optical
> 	 * drives.
> 	 */
> 	LB_ERASE_NUKE		= 0x08,
> 	/*
> 	 * Used by file systems that know that data is no longer
> 	 * in use and want to optimize the next write operations.
> 	 */
> 	LB_ERASE_DISCARD	= LB_ERASE_IGNORE | LB_ERASE_ALL_ZERO |
> 					LB_ERASE_ALL_ONE | LB_ERASE_NUKE,
> 	/*
> 	 * Used when we want the data to be invalidated and make sure
> 	 * it is no longer accessible.
> 	 */
> 	LB_ERASE_DESTROY	= LB_ERASE_ALL_ZERO | LB_ERASE_ALL_ONE |
> 					LB_ERASE_NUKE,
> };
> 
> I guess BIO_HINT_RELEASE would match LB_ERASE_DISCARD best,

Yep.

Actually, I'm curious why you'd want to expose, outside the block
driver, the semantics of reading a block which has been earlier
explicitly discarded.  Surely it's an error for a filesystem to
do that?  How does it help a filesystem to know in advance which
error case that will trigger.

> and perhaps
> there should be some bio flag with LB_ERASE_DESTROY semantics, although
> that doesn't really qualify as a hint any more.

Yes, that's more of a command ;-)

> My release command would be REQ_TYPE_LINUX_BLOCK/REQ_LB_OP_ERASE. Were
> you thinking of adding REQ_LB_* operations as well, or just encoding
> the hint in a REQ_TYPE_FS request?

I wasn't expecting a request to be created for the hint bio at all.

> Shall we move the discussion to a public mailing list? Feel free to
> forward my mail anywhere you like.

Done!

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux