Re: [RFC][PATCH 0/5] Fiemap, an extent mapping ioctl

Andreas Dilger <adilger@xxxxxxx> · Thu, 29 May 2008 16:01:53 -0600

On May 28, 2008  12:33 -0400, Chris Mason wrote:
> On Wednesday 28 May 2008, Andreas Dilger wrote:
> > For Lustre, it is completely inefficient to return data in non-LUN_ORDER,
> > because it is doing RAID-0 striping of the file data across data servers.
> > A 100MB 2-stripe file with 1MB stripes would have to return 100 extents,
> > even if the file data is allocated contiguously on disk in the backing
> > filesystems in two 50MB chunks.  With LUN_ORDER it will return 2 extents
> > and the user can see much more clearly that the file is layed out well.
> 
> Ah, so lustre doesn't have a logical address layer at all?  In my case the 
> files contain pointers to contiguous logical extent and the lower layers of 
> the FS figure out that is raid0/1/10 or whatever future crud I toss in.
> 
> If the logical extents are contiguous it is safe to assume the lower end is 
> also contiguous.

Well, Lustre has a logical address layer on a per-file basis, but the
layout maps from the file offsets to multiple object offsets.  There is
no "flat" logical device in the background which file allocations are
coming from, because the API provided to the client is based only on
objects and offsets, and there may be multiple objects that map into a
single file via some striping.  That is currently RAID-0 across objects,
but it might be RAID-1/5/6 or something else in the future.  With the
RAID-0 layout, the logical file offsets round-robin across the multiple
objects with a certain stripe size (default 1MB).

It sounds like you actually have the same setup with btrfs (if it is at
all like ZFS) that file blocks map onto multiple disks, and there may
be multiple copies of the data (RAID-1/10).

What a user/administrator really cares about in the end is whether
the files are allocated contiguously within the objects on the server
filesystems.  If we were to run filefrag (with FIEMAP support) on a
Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would
return a list of extents, each broken up at smaller boundaries, and it
will convey the wrong idea of how the file is layed out physically.

If we run fiemap (with LUN_ORDER) what will happen is we get the larger
(hopefully) extents that are actually contiguously allocated in the
backing filesystems.  Since this is a network object-based filesystem,
we don't really care about the _actual_ file offset->device block number
layout as much as the overall picture of file fragmentation and layout.

> > My point of view is that FIEMAP is a file layout visualization API that
> > could also be used in certain cases for direct data access.  Since any
> > direct access of data returned by FIEMAP is inherently racy (as is
> > FIBMAP), I'm less concerned with the mappings being fully consistent,
> > and more concerned with providing the maximum amount of information.
> >
> > Any application using FIEMAP for direct data access (e.g. dump of
> > some kind) either has to guard against races itself by verifying the
> > mapping again afterward, or for uses like lilo trust that the admin
> > is doing the right thing.  That isn't a new issue with FIEMAP vs FIBMAP.
> 
> So, I'm a big fan of better layout visualization and creating APIs to improve 
> it.  At some point we need to take a step back and ask if those apis are 
> better left to other tools instead of heaping them all into fiemap.  
> The advantage of dropping the lun support from fiemap and pushing it into a 
> new ioctl/syscall is that we can determine the underlying storage topology 
> for any logical block on the device, including those underneath md/dm without 
> worrying about a backing file.

Argh, that would make FIEMAP basically unsuitable for a mutli-device
filesystem like Lustre and pNFS and the future direction of XFS (I think),
and btrfs IMHO, or even ZFS.  There just isn't a single address space
that files can be mapped to.

A major reason I proposed FIEMAP to linux-fsdevel in the first
place, instead of just keeping it internal to Lustre and maybe ext4,
is because it is generally useful interface for efficiently determining
file layout information, and isn't tied to block devices like FIBMAP is.

It is useful for many different reasons like cp/tar to skip holes
(not using the physical offset information, just the logical extents
and flags like UNWRITTEN) to avoid reading empty parts of the file,
layout visualization, maybe defrag, etc.

Dropping lun/device support, and removing all of the flexibility of the
FIEMAP interface design, is IMHO killing the whole reason I proposed
FIEMAP in the first place.

> And then we can get interesting information about stripe widths, preferred
> IO sizes etc etc.

I agree that this part is somewhat orthogonal.  In the vast majority
of cases stripe width, stripe count, IO size, etc can be encapsulated
into a small number of parameters and does not need to be specified on a
per-block or per-extent basis.  The actual layout of the file (returned
by FIEMAP) is a natural consequence of these parameters, as they (may)
influence the filesystem in making allocation decisions.  Returning the
metadata layout as part of FIEMAP makes sense to me, because it boils
down to logical->{physical,device} ranges in the end.  Returning the
generic file layout parameters doesn't make sense, on the other hand.

Lustre (and in fact several of the HPC filesystem vendors) would like to
come up with a common API (virtual xattr?) to be able to extract and
restore the generic layout information but that is for a separate email.

Cheer , Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html