Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl

Chris Mason <chris.mason@xxxxxxxxxx> · Fri, 30 May 2008 09:37:58 -0400

On Thursday 29 May 2008, Andreas Dilger wrote:
> On May 28, 2008  12:33 -0400, Chris Mason wrote:
> > On Wednesday 28 May 2008, Andreas Dilger wrote:
> > > For Lustre, it is completely inefficient to return data in
> > > non-LUN_ORDER, because it is doing RAID-0 striping of the file data
> > > across data servers. A 100MB 2-stripe file with 1MB stripes would have
> > > to return 100 extents, even if the file data is allocated contiguously
> > > on disk in the backing filesystems in two 50MB chunks.  With LUN_ORDER
> > > it will return 2 extents and the user can see much more clearly that
> > > the file is layed out well.
> >
> > Ah, so lustre doesn't have a logical address layer at all?  In my case
> > the files contain pointers to contiguous logical extent and the lower
> > layers of the FS figure out that is raid0/1/10 or whatever future crud I
> > toss in.
> >
> > If the logical extents are contiguous it is safe to assume the lower end
> > is also contiguous.
>
> Well, Lustre has a logical address layer on a per-file basis, but the
> layout maps from the file offsets to multiple object offsets.  There is
> no "flat" logical device in the background which file allocations are
> coming from, because the API provided to the client is based only on
> objects and offsets, and there may be multiple objects that map into a
> single file via some striping.  That is currently RAID-0 across objects,
> but it might be RAID-1/5/6 or something else in the future.  With the
> RAID-0 layout, the logical file offsets round-robin across the multiple
> objects with a certain stripe size (default 1MB).
>
> It sounds like you actually have the same setup with btrfs (if it is at
> all like ZFS) that file blocks map onto multiple disks, and there may
> be multiple copies of the data (RAID-1/10).

In my case, all pointers to extents (both metadata blocks and file data) 
reference a logical address space.  So, even for raid10 or raid5/6 if I ever 
code it, there is a central place that does translation from 
logical->physical block(s).

The disk format supports multiple (2^64) such namespaces but that isn't being 
used yet.

>
>
> What a user/administrator really cares about in the end is whether
> the files are allocated contiguously within the objects on the server
> filesystems.  If we were to run filefrag (with FIEMAP support) on a
> Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would
> return a list of extents, each broken up at smaller boundaries, and it
> will convey the wrong idea of how the file is layed out physically.
>

For Btrfs, it'll always return the logical extents, and because the storage is 
grouped in relatively large chunks (~1GB  or more), this is sufficiently 
enough for measuring fragmentation.

But, if lustre doesn't have this kind of logical backing store, I think it is 
reason enough to keep the lun interface.  I know lots of people are against 
adding interfaces to the kernel for out of tree projects, but the per-file 
logical mapping you describe is a very reasonable way to design things, and 
we might as well leave it in for future use.

> Dropping lun/device support, and removing all of the flexibility of the
> FIEMAP interface design, is IMHO killing the whole reason I proposed
> FIEMAP in the first place.

My goal isn't to remove the flexibility from the interface design, it is just 
to ask if all of this functionality needs to be in one ioctl.  At least the 
device number / lun bit makes sense now (Mark, if you keep it, please don't 
make this a dev_t) thanks for the extra details.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html