Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl

Andreas Dilger <adilger@xxxxxxx> · Wed, 28 May 2008 10:09:31 -0600

On May 27, 2008  13:19 -0400, Chris Mason wrote:
> On Tuesday 27 May 2008, jim owens wrote:
> > For what it is worth, a few comments from a newbie who has
> > experience with a non-linux filesystem that has a similar API
> > and supports files spread across multiple devices.
> >
> > Mark Fasheh wrote:
> > > * FIEMAP_FLAG_LUN_ORDER
> > > If the file system stripes file data, this will return contiguous
> > > regions of physical allocation, sorted by LUN. Logical offsets may not
> > > make sense if this flag is passed. If the file system does not support
> > > multiple LUNs, this flag will be ignored.
> >
> > This should return an error (ENOTSUPPORTED ?) if the FS does
> > not support multiple devices OR does not support sort-by-lun-order
> > so the caller does not count on the info being sorted.  Even an FS
> > that supports multiple devices per file may be unable to sort it
> > by on-disk-order without consuming an ugly set of resources.
> 
> That's a good point, I couldn't provide 100% sorted output even if I wanted 
> to.

I'm OK with this also.  The only reason I thought "simple" filesystems
(i.e. single-lun) should ignore FLAG_LUN_ORDER is so that tools like
filefrag can always try with LUN_ORDER and in most cases still get a
mapping returned.  If the filesystem doesn't care about LUN_MAPPING, no
harm done, because all of the extents live on a single LUN anyways.  If
a multi-device filesystem doesn't want to implement LUN_ORDER, returning
-EBADR is perfectly acceptable because the application will retry without
the unsupported flags (LUN_ORDER in this case) and get the logical file
offset order data returned.

For Lustre, it is completely inefficient to return data in non-LUN_ORDER,
because it is doing RAID-0 striping of the file data across data servers.
A 100MB 2-stripe file with 1MB stripes would have to return 100 extents,
even if the file data is allocated contiguously on disk in the backing
filesystems in two 50MB chunks.  With LUN_ORDER it will return 2 extents
and the user can see much more clearly that the file is layed out well.

> > Christoph Hellwig wrote:
> > >>	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> > >
> > > Again this lun thing is horribly ill-defined.  There is no such thing
> > > as a logic device number in our filesystem terminology.
> >
> > I agree that LUN is confusing.  In my opinion the words "logical"
> > and "number" are overused and meaningless.  As Brad suggested,
> > "device" would be preferable, or "unit", but unfortunately every
> > word I can think of has some other definition too :)

Calling it "device" instead of "LUN" for this is fine...

> > Christoph Hellwig wrote:
> > > Well, we could add a dev field that contains the dev_t for the
> > > underlying block device.  That would work for the current XFS realtime
> > > device aswell as for my work to map different XFS AGs to different
> > > devices.  It wouldn't work for btrfs with integrated raid code where
> > > a single extent can span multiple underlying devices, the same probably
> > > applies to pnfs.

... but I don't think it should necessarily be _required_ to return a
real "dev_t" (major, minor) device.  For network filesystems this is
meaningless.  If it is possible for FIEMAP_EXTENT_NET to signal that the
device is not a local/physical device (where a dev_t has no meaning),
and simply allow an enumeration [0, 1, 2, ...] of the logical devices
then I think this is reasonable.  The mapping of logical devices to
servers is available separately with a Lustre-specific ioctl.

This passes more information for filesystems that have local devices
while not breaking the functionality for network filesystems and could
be used as an efficient replacement for lilo's use of FIBMAP.

> > Chris Mason wrote:
> > > For btrfs I would return the logical extents via fiemap (just like the
> > > file were on lvm) and make btrfs specific ioctls for details about where
> > > the file actually lived.
> > >
> > > fiemap alone isn't a great way to describe raid levels or complex storage
> > > topologies.  To include physical information I would also have to encode
> > > the raid level used and information about all the devices the data is
> > > replicated on (raid1/10)
> >
> > fiemap by itself is useful for programs that want to determine
> > how fragmented a file is or where sparse areas are to skip.

For RAID1/10 you can return multiple logical->physical extent mappings
for the same logical range of the file with different "device" IDs.  You
could do the same for RAID5 returning each of the data and parity chunks
with "NO_DIRECT" if desired (maybe only on the parity extent, or don't
return the parity extent at all).  The spec does not require that the
returned extents be non-overlapping.

In fact Mark, Eric, and I were discussing the ability to request mappings
for metadata blocks in addition to the data blocks.  The metadata blocks
would also overlap the data blocks (with FLAG_METADATA set in the
metadata extent) so that it is possible to return to the client (if
requested) the inode block with [0-EOF] mapping, indirect blocks with
their corresponding data mappings, and the file data blocks.

This came up in the context of ext4 trying to visualize different
metadata placement algorithms and would be very useful information.
It might also be useful for filesystem defragmentation utilities.

> Yes, and since it has no concurrency semantics, use outside of that quickly 
> gets difficult.  fibmap is used by lilo, and reiserfs needs a special ioctl 
> that said i've-called-fibmap-please-don't-move-these-bytes that prevented 
> tail packing.

Wasn't that turned into an ext3-like SETFLAGS ioctl for "NOTAIL" on
the inode?

My point of view is that FIEMAP is a file layout visualization API that
could also be used in certain cases for direct data access.  Since any
direct access of data returned by FIEMAP is inherently racy (as is
FIBMAP), I'm less concerned with the mappings being fully consistent,
and more concerned with providing the maximum amount of information.

Any application using FIEMAP for direct data access (e.g. dump of
some kind) either has to guard against races itself by verifying the
mapping again afterward, or for uses like lilo trust that the admin
is doing the right thing.  That isn't a new issue with FIEMAP vs FIBMAP.

> > A final thought on this:
> > > 	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> >
> > While the flags field can be used to tell the validity of this
> > number, we found that starting at 0 was not a good practice.
> > We started at 1 so 0 was always a not-valid.  One way this can
> > be useful is if you have delayed allocation, you can indicate
> > "intended device" with a non-0 number.  Of course other values
> > such as max_int could be termed "invalid" instead.
> 
> I use 0 as not-valid as well.  The original intent was 0 meant 
> logical-block-number, signaling additional lookups were needed.  But I 
> haven't found a good use case for that yet.

I would prefer that the fe_lun (or fe_device as is now preferred)
be at least somewhat implementation-specific.  For local filesystems,
returning the device number seems reasonable and would mean that "0"
is not a valid return value, but I'd prefer to allow this to be an index
number for Lustre or other non-local filesystems, and in that case
"0" be a valid device index number.  Since there are already flags for
unallocated and unkown extents, I don't think we should depend on
fe_device == 0 to have a special meaning for network filesystems.

> > Another point to document is whether this number is a contiguous
> > series (1, 2, 3,... N) defining the location based on the current
> > device list or is possibly a sparse (1, 2, 6) series because the
> > FS tracks devices that have been removed.  In our implementation
> > both views were present for different consumers.  The sparse
> > series was native and the contiguous series a translation.
> 
> Interesting, I've been presenting the sparse representation only.

For Lustre, the fe_lun/fe_device returned for a file extent would
indicate the base-0 server index on which each of the file fragments
resides.  A given file will normally be striped over a subset of the
data servers, so it would be normal to get extents returned that
are a sparse subset of all available data servers.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html