Thin provisioning & arrays

Black_David@xxxxxxx · Sun, 9 Nov 2008 22:40:24 -0500

Wow, this discussion can generate a lot of traffic ....  I think
Dave Chinner's recent message is as good a place as any to start:

> It also ignores the fact that as the filesystem ages they will have
> fewer and fewer aligned free chunks as the free space fragments.
> Over time, arrays using large allocation chunks are simply going to
> be full of wasted space as filesystem allocation patterns degrade
> if the array vendors ignore this problem.
> 
> And no matter what us filesystem developers do, there is always
> going to be degradation in allocation patterns as the filesystems
> fill up and age. While we can try to improve aging behaviour, it
> doesn't solve the problem for array vendors - they need to be
> smarter about their allocation and block mapping....

I can't argue with that - this sort of internal fragmentation is
a consequence of using a large thin provisioning chunk size.  As
for why array vendors did this (and EMC is not the only vendor
that uses a large chunk size), the answer is based on the original
motivating customer situation, for example:
- The database admin insists that this new application needs 4TB, 
	so 4TB of storage is provisioned.
- 3 months later, the application is using 200GB, and not growing
	much, if at all.
Even a 1GB chunk size makes a big difference for this example ...

As for arrays and block tracking, EMC arrays work with 4kB blocks
internally.  A write of smaller than 4kB will often result in
reading the rest of the block from the disk into cache.  The
track size that Ric mentioned (64k) is used to manage on-disk
capacity, but the array knows how to do partial track writes.

As for what to optimize for, the chunk size is going to vary widely
across different arrays (even EMC's CLARiiON won't use the same
chunk size as Symmetrix).  Different array implementers will make
different decisions about how much state is reasonable to keep.
My take on this is that I agree with Ric's comment:

> In another email Ted mentions that it makes sense for the FS allocator
> to notice we've just freed the last block in an aligned region of size
> X, and I'd agree with that.
>
> The trim command we send down when we free the block could just
contain
> the entire range that is free (and easy for the FS to determine) every
> time.

In other words, the filesystem ought to do a small amount of work to
send down the largest (reasonable) range that it knows is free - this
seems likely to be more effective than relying on the elevators to
make this happen.  

There will be a chunk size value available in a VPD page that can be
used to determine minimum size/alignment.  For openers, I see
essentially
no point in a 512-byte UNMAP, even though it's allowed by the standard -
I suspect most arrays (and many SSDs) will ignore it, and ignoring
it is definitely within the spirit of the proposed T10 standard (hint:
I'm one of the people directly working on that proposal).  OTOH, it
may not be possible to frequently assemble large chunks for arrays
that use them, and I agree with Dave Chinner's remarks on free space
fragmentation (quoted above) - there's no "free lunch" there.

Beyond this, I think there may be an underlying assumption that the
array and filesystem ought to be in sync (or close to it) on based
on overhead and diminishing marginal returns of trying to get ever-
closer sync.  Elsewhere, mention has been made of having the
filesystem's free list behavior be LIFO-like, rather than constantly
allocating new blocks from previously-unused space.  IMHO, that's a
good idea for thin provisioning. 

Now, if the workload running on the filesystem causes the capacity
used to stay within a range, there will be a set of relatively "hot"
blocks on the free list that are being frequently freed and reallocated.
It's a performance win not to UNMAP those blocks (saves work in both
the kernel and on the array), and hence to have the filesystem and
array views of what's in use not line up.

Despite the derogatory comment about defrag, it's making a comeback.
I already know of two thin-provisioning-specific defrag utilities from
other storage vendors (neither is for Linux, AFAIK).  While defrag is
not a wonderful solution, it does free up space in large contiguous
ranges, and most of what it frees will be "cold".

Thanks,
--David
p.s.  Apologies in advance for slow responses - my Monday "crisis"
	is already scheduled ...
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html