Re: block layer API for file system creation - when to use multidisk mode

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 1 Dec 2018 15:35:09 +1100

On Fri, Nov 30, 2018 at 01:00:52PM -0500, Ric Wheeler wrote:
> On 11/30/18 7:55 AM, Dave Chinner wrote:
> >On Thu, Nov 29, 2018 at 06:53:14PM -0500, Ric Wheeler wrote:
> >>Other file systems also need to
> >>accommodate/probe behind the fictitious visible storage device
> >>layer... Specifically, is there something we can add per block
> >>device to help here? Number of independent devices
> >That's how mkfs.xfs used to do stripe unit/stripe width calculations
> >automatically on MD devices back in the 2000s. We got rid of that
> >for more generaly applicable configuration information such as
> >minimum/optimal IO sizes so we could expose equivalent alignment
> >information from lots of different types of storage device....
> >
> >>or a map of
> >>those regions?
> >Not sure what this means or how we'd use it.
> >Dave.
> 
> What I was thinking of was a way of giving up a good outline of how
> many independent regions that are behind one "virtual" block device
> like a ceph rbd or device mapper device. My assumption is that we
> are trying to lay down (at least one) allocation group per region.
> 
> What we need to optimize for includes:
> 
>     * how many independent regions are there?
> 
>     * what are the boundaries of those regions?
> 
>     * optimal IO size/alignment/etc
> 
> Some of that we have, but the current assumptions don't work well
> for all device types.

Oh, so essential "independent regions" of the storage device. I
wrote this in 2008:

http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption#Failure_Domains

This was derived from the ideas in prototype code I wrote in ~2007
to try to optimise file layout and load distribution across linear
concats of multi-TB RAID6 luns. Some of that work was published
long after I left SGI:

https://marc.info/?l=linux-xfs&m=123441191222714&w=2

Essentially, independent regions - called "Logical
Extension Groups", or "legs" of the filesystem - and would
essentially be an aggregation of AGs in that region. The
concept was that we'd move the geometry information from the
superblock into the legs, and so we could have different AG
geoemetry optimies for each independent leg of the filesystem.

eg the SSD region could have numerous small AGs, the large,
contiguous RAID6 part could have maximally size AGs or even make use
of the RT allocator for free space management instead of the
AG/btree allocator. Basically it was seen as a mechanism for getting
rid of needing to specify block devices as command line or mount
options.

Fundamentally, though, it was based on the concept that Linux would
eventually grow an interface for the block device/volume manager to
tell the filesystem where the independent regions in the device
were(*), but that's not something that has ever appeared. If you can
provide an indepedent region map in an easy to digest format (e.g. a
set of {offset, len, geometry} tuples), then we can obviously make
use of it in XFS....

Cheers,

Dave.

(*) Basically provide a linux version of the functionality Irix
volume managers had provided filesystems since the late 80s....

-- 
Dave Chinner
david@xxxxxxxxxxxxx