Re: Software RAID and TRIM

David Brown <david@xxxxxxxxxxxxxxx> · Thu, 30 Jun 2011 09:50:28 +0200

On 30/06/2011 02:28, NeilBrown wrote:
On Wed, 29 Jun 2011 14:46:08 +0200 David Brown<david@xxxxxxxxxxxxxxx>  wrote:

On 29/06/2011 12:45, NeilBrown wrote:
On Wed, 29 Jun 2011 11:32:55 +0100 (BST) Tom De Mulder<tdm27@xxxxxxxxx>
wrote:

On Tue, 28 Jun 2011, Mathias Burén wrote:

IIRC md can already pass TRIM down, but I think the filesystem needs
to know about the underlying architecture, or something, for TRIM to
work in RAID.

Yes, it's (usually/ideally) the filesystem's job to invoke the TRIM
command, and that's what ext4 can do. I have it working just fine on
single drives, but for reasons of service reliability would need to get
RAID to work.

I tried (on an admittedly vanilla Ubuntu 2.6.38 kernel) the same on a two
drive RAID1 md and it definitely didn't work (the blocks didn't get marked
as unused and zeroed).

There's numerous discussions on this in the archives of
this mailing list.

Given how fast things move in the world of SSDs at the moment, I wanted to
check if any progress was made since. :-) I don't seem to be able to find
any reference to this in recent kernel source commits (but I'm a complete
amateur when it comes to git).

Trim support for md is a long way down my list of interesting projects (and
no-one else has volunteered).

It is not at all straight forward to implement.

For stripe/parity RAID, (RAID4/5/6) it is only safe to discard full stripes at
a time, and the md layer would need to keep a record of which stripes had been
discarded so that it didn't risk trusting data (and parity) read from those
stripes.  So you would need some sort of bitmap of invalid stripes, and you
would need the fs to discard in very large chunks for it to be useful at all.

For copying RAID (RAID1, RAID10) you really need the same bitmap.  There
isn't the same risk of reading and trusting discarded parity, but a resync
which didn't know about discarded ranges would undo the discard for you.

So is basically requires another bitmap to be stored with the metadata, and a
fairly fine-grained bitmap it would need to be.  Then every read and resync
checks the bitmap and ignores or returns 0 for discarded ranges, and every
write needs to check and if the range was discard, clear the bit and write to
the whole range.

So: do-able, but definitely non-trivial.

Wouldn't the sync/no-sync tracking you already have planned be usable
for tracking discarded areas?  Or will that not be find-grained enough
for the purpose?

That would be a necessary precursor to DISCARD support: yes.
DISCARD would probably require a much finer grain than I would otherwise
suggest but I would design the feature to allow a range of granularities.

I suppose the big win for the sync/no-sync tracking is when initialising 
an array - arrays that haven't been written don't need to be in sync. 
But you will probably be best with a list of sync (or no-sync) areas for 
that job, rather than a bitmap, as there won't be very many such blocks 
(a few dozen, perhaps, for multiple partitions and filesystems like XFS 
that write in different areas) and as the disk gets used, the "no-sync" 
areas will decrease in size and number.  For DISCARD, however, you'd get 
no-sync areas scattered around the disk.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html