Re: md road-map: 2011

Daniel Reurich <daniel@xxxxxxxxxxxxxxxx> · Wed, 23 Feb 2011 18:06:39 +1300

On Wed, 2011-02-16 at 21:27 +1100, NeilBrown wrote:

> Bitmap of non-sync regions.
> ---------------------------

> The granularity of the bit is probably quite hard to get right.
> Having it match the block size would mean that no resync would be
> needed and that every discard request could be handled exactly.
> However it could result in a very large bitmap - 30 Megabytes for a 1
> terabyte device with a 4K block size.  This would need to be kept in
> memory and looked up for every access, which could be problematic.
> 
Why not store the map as a list of regions defined by:
<start address><finish address>.  This may provide a better performance
vs (storage+memory) cost implementation when compared with a bitmap
which has a granularity vs storage problem.

It may well be more efficient to store a range list then a bitmap and
makes granularity a non-issue as granularity will be at blocksize.   The
limitation with this scheme is in choosing the size of the map, and the
larger the map the more regions that can be stored before no longer
being able to add new discards or splits (due to a write somewhere in
the middle of a non-sync region).  However this could be handled to
retain the best performance by ensuring that the largest non-sync
regions are always in the list

If we used full LBA48 addressing we could count on for each entry
in the map to 12bytes (2x48bit).  (Perhaps this could be reduced for
smaller devices that need less address bits.)  This would mean 85.3
entries per kB, or 87381.33 per Mb of map size on disk (excluding
possible headers).  In the case of a 1Tb raid volume a 1Mb map provide
roughly 1 entry for every 13Mb of disk space.  This sounds coarse but
when you consider you are setting regions based in units of the media's
block size it's not.  Furthermore once the filesystem is that fragmented
that you've exhausted the map space, the unhandled non-sync|discarded
regions would be so small that you'd gain little benefit from it.  A bit
of logic could ensure that large regions take precedence over smaller
regions, as this will provide the best performance for resync/check
passes.

Another benefit is that it makes it easy for md to be-able to pass
TRIM instructions down to media that support this feature whenever a
region/stripe is marked as non-sync.  In the case of raid levels
0 and linear there would be no need for a map and TRIM could be passed
through to the media.  For Raid1,10 a TRIM would be issued to the media
whenever a chunk is contained entirely within a non-sync region. With
raid456, a TRIM would only be issued when a whole stripe is contained
within a non-sync region.

The real beauty of this region map is that creation of a new raid volume
could (unless --assume-clean is set) mark the entire volume as non-sync
with a single entry in the list.

Of course this suggestion is only theoretical, and I might be way off on
the implementation cost vs benefits and feasability.

Regards,
-- 
Daniel Reurich.

Centurion Computer Technology (2005) Ltd
Mobile 021 797 722

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html