Re: thin provisioned LUN support

Ric Wheeler <rwheeler@xxxxxxxxxx> · Fri, 07 Nov 2008 16:04:52 -0500

Chris Mason wrote:
On Fri, 2008-11-07 at 15:26 -0500, Ric Wheeler wrote:

Matthew Wilcox wrote:

On Fri, Nov 07, 2008 at 03:19:13PM -0500, Theodore Tso wrote:

Let's be just a *little* bit fair here.  Suppose we wanted to
implement thin-provisioned disks using devicemapper and LVM; consider
that LVM uses a default PE size of 4M for some very good reasons.
Asking filesystems to be a little smarter about allocation policies so
that we allocate in existing 4M chunks before going onto the next, and
asking the block layer to pool trim requests to 4M chunks is not
totally unreasonable.

Array vendors use chunk sizes > than typical filesystem chunk sizes
for the same reason that LVM does.  So to say that this is due to
purely a "broken firmware architecture" is a little unfair.

I think we would have a full-throated discussion about whether the
right thing to do was to put the tracking in the block layer or in LVM.
Rather similar to what we're doing now, in fact.

You definitely could imagine having a device mapper target that could 
track the discards commands and subsequent writes which would invalidate 
the previous discards.

Actually, it would be kind of nice to move all of this away from the 
file systems entirely.

* Fast
* Crash safe
* Bounded ram usage
* Accurately deliver the trims

Pick any three ;)  If we're dealing with large files, I can see it
working well.  For files that are likely to be smaller than the physical
extent size, you end up with either extra state bits on disk (and
keeping them in sync) or a log structured lvm.

I do agree that an offline tool to account for bytes used would be able
to make up for this, and from a thin provisioning point of view, we
might be better off if we don't accurately deliver all the trims all the
time.

Given the best practice more or less states that users need to have set 
the high water mark sufficiently low to allow storage admins to react, I 
think a tool like this would be very useful.

Think of how nasty it would be to run out of real blocks on a device 
that seems to have plenty of unused capacity :-)

People just use the space again soon anyway, I'd have to guess the
filesystems end up in a steady state outside of special events.

In another email Ted mentions that it makes sense for the FS allocator
to notice we've just freed the last block in an aligned region of size
X, and I'd agree with that.

The trim command we send down when we free the block could just contain
the entire range that is free (and easy for the FS to determine) every
time.

-chris

I think sending down the entire contiguous range of freed sectors would work well with these boxes...

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html