Re: LVM striping RAID volumes

David Brown <david.brown@xxxxxxxxxxxx> · Wed, 25 Jan 2012 18:56:01 +0100

On 25/01/12 14:55, Peter Grandi wrote:
2) must support passing TRIM commands through the RAID layer
(e.g. ext4->LVM->RAID->SSD) to avoid write amplification that
reduces SSD lifetime and performance

That's not really necessary with modern SSD's - TRIM is
overrated. Garbage collection on current generations is so
much better than on earlier models that you generally don't
have to worry about TRIM.

Unfortunately not necessarily just for write amplification, and
the "cleaner" (aka garbage collector) is really helped by TRIM.

The really big deal is that the FTL in the flash SSD cannot
figure out which flash-pages are unused, and cannot use a simple
heuristic like "it is all zeroes" because filesystem code do not
zero unused logical sectors when they are released but writes
them only much later when they are allocated. TRIM is just a a
way to ''write'' a logical sector as unused without zero-filling
it (or other implicit marks).

Dropping TRIM makes your life /much/ easier with SSD's,
especially when you want raid.  According to some benchmarks
I've seen, it also makes the disk measurably faster.

While something like TRIM is really important, there is a bad
reputation of TRIM, but it is due to SATA TRIM being specified
badly, as it is specified to be synchronous (or cache-flushing
or queue flushing).

I've read about this in a few places - there are several failing points 
in SATA TRIM that make it difficult to implement and much less useful 
than it could be.

One problem is that TRIM is synchronous, as you say.  That means if it 
is used during deletes, it makes them much slower - potentially very 
much slower.  Secondly, there is no consistency as to what is read back 
from a trimmed sector.  Had it always been read as zero, it would suit 
much better for raid.

Anyhow, apart from write amplification, the really big deal is
maximum write latency (and relatedly read latency!). Consider
this scary comparison:

   http://www.storagereview.com/images/samsung_830_raid_256gb_write_latency.png

as discussed in one of my many recent flash SSD blog entries:

   http://www.sabi.co.uk/blog/12-one.html#120115

Since erasing a flash-block can take a long time, it is very
important for minimizing the highest write latency that the FTL
have available a pool of pre-erased flash-blocks, so they can be
written (OR'ed) to directly ("overprovisioning" in most flash
SSDs is done to allow this too).

Overprovisioning is the key here.  When the SSD has more flash space 
than is visible to the OS, then that space is always guaranteed free - 
though not necessarily in contiguous erase blocks.  The more such free 
space there is, the higher the chances of their being free full blocks 
when they are needed, and the more flexibility the SSD firmware has in 
combining partly-written blocks to free up full erase blocks.

So if you have sufficient free space due to overprovisioning, you quite 
simply do not need TRIM, as TRIM is just an expensive way of increasing 
this free space.

How much overprovisioning you want depends on how much you want to 
reduce the risk of unexpected latencies, and how much extra space you 
are willing to pay for.  More expensive (or rather, higher quality) 
SSD's have more overprovisioning.  You can also make your own 
overprovisioning by simply not allocating all the disk when partitioning 
it (or using a smaller "size" when using the whole disk in an mdadm 
raid).  Since there is an area that is never written to, it is 
effectively extra overprovisioned space.

The problem is that the "cleaner" (aka garbage collector) can
only pack "used" flash-pages together, thus creating empty
flash-blocks, if it knows which logical sectors and thus
flash-pages are "unused".

Since the TRIM command is synchronous it is often a bad idea to
use it on every logical sector deallocation in filesystem code,
but it or FITRIM should be used at least periodically (for
example during 'fsck') to tell the FTL which logical sectors are
unused so it can rebuild the pool of empty flash-blocks, and
doing it periodically would work around the synchronous nature
of SATA TRIM.

Also TRIM and FITRIM are useful for any case of virtualization,
not just for flash SSD layers, for example for "sparse" (aka
thin provisioning) VM disk images.

It would be nice if MD passed on TRIM or at least FITRIM, and I
have just done a search and there is a discussion of some issues
with that here:

http://lkml.indiana.edu/hypermail/linux/kernel/1011.2/02184.html

  «the only really complex part is sending something like that
   into MDraid, because that one set of ranges might explode into
   thousands of ranges and then have to be coalesced back down to
   a more manageable number of ranges.

   ie. with a simple raid 0, each range will need to be broken
   into a bunch of stride sized ranges, then the contiguous
   strides on each spindle coalesced back into larger ranges.

   But if MDraid can handle discards now with one range, it
   should not be that hard to teach it handle a group of ranges.»

This perplexes me because the logic should be identical to that
of writing: TRIM is in effect a variant of WRITE.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html