Re: Thin provisioning & arrays

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 10 Nov 2008 19:31:26 +1100

On Sun, Nov 09, 2008 at 10:40:24PM -0500, Black_David@xxxxxxx wrote:
> Wow, this discussion can generate a lot of traffic ....  I think
> Dave Chinner's recent message is as good a place as any to start:
>  
> > It also ignores the fact that as the filesystem ages they will have
> > fewer and fewer aligned free chunks as the free space fragments.
> > Over time, arrays using large allocation chunks are simply going to
> > be full of wasted space as filesystem allocation patterns degrade
> > if the array vendors ignore this problem.

[snip a bunch of stuff I can't add anything too ;]

> My take on this is that I agree with Ric's comment:
> 
> > In another email Ted mentions that it makes sense for the FS allocator
> > to notice we've just freed the last block in an aligned region of size
> > X, and I'd agree with that.
> >
> > The trim command we send down when we free the block could just
> contain
> > the entire range that is free (and easy for the FS to determine) every
> > time.
> 
> In other words, the filesystem ought to do a small amount of work to
> send down the largest (reasonable) range that it knows is free - this
> seems likely to be more effective than relying on the elevators to
> make this happen.  
> 
> There will be a chunk size value available in a VPD page that can be
> used to determine minimum size/alignment.  For openers, I see
> essentially
> no point in a 512-byte UNMAP, even though it's allowed by the standard -
> I suspect most arrays (and many SSDs) will ignore it, and ignoring
> it is definitely within the spirit of the proposed T10 standard (hint:
> I'm one of the people directly working on that proposal).

I think this is the crux of the issue. IMO, it's not much of a standard
when the spirit of the standard is to allow everyone to implement
different, non-deterministic behaviour....

> OTOH, it
> may not be possible to frequently assemble large chunks for arrays
> that use them, and I agree with Dave Chinner's remarks on free space
> fragmentation (quoted above) - there's no "free lunch" there.

> Beyond this, I think there may be an underlying assumption that the
> array and filesystem ought to be in sync (or close to it) on based
> on overhead and diminishing marginal returns of trying to get ever-
> closer sync. 

I'm not sure I follow - it's possible to have perfect
synchronisation between the array and the filesystem, including
crash recovery.

> Elsewhere, mention has been made of having the
> filesystem's free list behavior be LIFO-like, rather than constantly
> allocating new blocks from previously-unused space.  IMHO, that's a
> good idea for thin provisioning. 

Which is at odds with preventing fragmentation and minimising the
effect of aging on the filesystem. This, in turn, is bad for thin
provisioning because file fragmentation leads to free space
fragmentation over time.

I think the overall goal for filesystems in thin provisioned
environments should be to minimise free space fragmentation - it's
when you fail to have large contiguous regions of free space in 
the filesystem that thin provisioning becomes difficult. How this
is acheived will be different for every filesystem.

> Now, if the workload running on the filesystem causes the capacity
> used to stay within a range, there will be a set of relatively "hot"
> blocks on the free list that are being frequently freed and reallocated.
> It's a performance win not to UNMAP those blocks (saves work in both
> the kernel and on the array), and hence to have the filesystem and
> array views of what's in use not line up.

In that case, the filesystem tracks what it has not issued unmaps
on, so really there is no discrepancy between the filesystem and the
array in terms of free sapce. The filesystem simply has a "free but
not quite free" list of blocks that haven't been unmapped.

This is like the typical two-stage inode delete that most
journalling filesystems use - one stage to remove it from the
namespace and move it to a "to be freed" list, and then a second
stage to really free it. Inodes on the "to be freed" list can be
reused without being freed, and if a crash occurs they can be
really freed up during recovery. Issuing unmaps is conceptually
little different to this.....

> Despite the derogatory comment about defrag, it's making a comeback.
> I already know of two thin-provisioning-specific defrag utilities from
> other storage vendors (neither is for Linux, AFAIK).  While defrag is
> not a wonderful solution, it does free up space in large contiguous
> ranges, and most of what it frees will be "cold".

The problem is that it is the wrong model to be using for thin
provisioning. It assumes that unmapping blocks as we free them
is fundamentally broken - if unmapping as we go works and is made
reliable, then there is no need for such a defrag tool. Unmapping
can and should be made reliable so that we don't have to waste
effort trying to fix up mismatches that shouldn't have occurred in
the first place...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html