Re: thin provisioned LUN support

Theodore Tso <tytso@xxxxxxx> · Fri, 7 Nov 2008 15:42:20 -0500

On Fri, Nov 07, 2008 at 01:21:49PM -0700, Matthew Wilcox wrote:
> 
> I think we would have a full-throated discussion about whether the
> right thing to do was to put the tracking in the block layer or in LVM.
> Rather similar to what we're doing now, in fact.

Agreed.  I'm just saying that what the array vendors are pushing for
is not totally unreasonable.  This problem can be separated into two
issues.  One is whether or not trim requests have to be 4 meg (or some
other size substantially bigger than filesystem block size) aligned,
and the other is whether the provisioning chunk size is 4 meg. 

The latter still would most ideally work well with filesystems which
are aware of this fact and try hard to allocate to keep as many 4 meg
chunks as possible completely unused, and to try very hard to allocate
using 4 meg chunks that are already partially unused.

Where the trim request coalescing happens is a more interesting
question.  You can either do it in the filesystem, in the block device
layer, or in the storage arraydevice itself.  One interesting thought
is that perhaps it may actually make more sense to do it in the
filesystem.  Since the filesystem has block allocation data structures
that already tell it which blocks are in use or not, there's no point
replicating that in the data array --- and so the filesystem can
detect when the last 4k block in a 4 meg chunk has been freed, and
then issue the trim request for the 4 meg TRIM/UNMAP request to the
block array.  One advantage of doing it in the filesystem is that the
block allocation data structures are already journaled, and so by
keying this off filesystem's block allocation structures, we won't
lose any potential TRIM requests even across a reboot.  (In contrast,
if the block device or the storage array is managing a list of trim
requests and in hopes of merging enough pieces to cover a 4 meg
aligned TRIM request, the in-memory rbtree is transient and would be
lost if the machine reboots.)

Sure, no filesystemsdo this now, but it's a just a Small Matter of
Programming --- and array vendors like EMC (cough, cough), could
easily pay for some filesystem hackers to implement this for some
popular Linux filesystem.  It could even be a directed funding program
through the Linux Foundation if EMC doesn't feel it has sufficient
people who have expertise in the upstream kernel development process.  :-)

       	   		     	 	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html