On May 28, 2008 12:33 -0400, Chris Mason wrote: > On Wednesday 28 May 2008, Andreas Dilger wrote: > > For Lustre, it is completely inefficient to return data in non-LUN_ORDER, > > because it is doing RAID-0 striping of the file data across data servers. > > A 100MB 2-stripe file with 1MB stripes would have to return 100 extents, > > even if the file data is allocated contiguously on disk in the backing > > filesystems in two 50MB chunks. With LUN_ORDER it will return 2 extents > > and the user can see much more clearly that the file is layed out well. > > Ah, so lustre doesn't have a logical address layer at all? In my case the > files contain pointers to contiguous logical extent and the lower layers of > the FS figure out that is raid0/1/10 or whatever future crud I toss in. > > If the logical extents are contiguous it is safe to assume the lower end is > also contiguous. Well, Lustre has a logical address layer on a per-file basis, but the layout maps from the file offsets to multiple object offsets. There is no "flat" logical device in the background which file allocations are coming from, because the API provided to the client is based only on objects and offsets, and there may be multiple objects that map into a single file via some striping. That is currently RAID-0 across objects, but it might be RAID-1/5/6 or something else in the future. With the RAID-0 layout, the logical file offsets round-robin across the multiple objects with a certain stripe size (default 1MB). It sounds like you actually have the same setup with btrfs (if it is at all like ZFS) that file blocks map onto multiple disks, and there may be multiple copies of the data (RAID-1/10). What a user/administrator really cares about in the end is whether the files are allocated contiguously within the objects on the server filesystems. If we were to run filefrag (with FIEMAP support) on a Lustre file without LUN_ORDER, or maybe a RAID-5 btrfs file, it would return a list of extents, each broken up at smaller boundaries, and it will convey the wrong idea of how the file is layed out physically. If we run fiemap (with LUN_ORDER) what will happen is we get the larger (hopefully) extents that are actually contiguously allocated in the backing filesystems. Since this is a network object-based filesystem, we don't really care about the _actual_ file offset->device block number layout as much as the overall picture of file fragmentation and layout. > > My point of view is that FIEMAP is a file layout visualization API that > > could also be used in certain cases for direct data access. Since any > > direct access of data returned by FIEMAP is inherently racy (as is > > FIBMAP), I'm less concerned with the mappings being fully consistent, > > and more concerned with providing the maximum amount of information. > > > > Any application using FIEMAP for direct data access (e.g. dump of > > some kind) either has to guard against races itself by verifying the > > mapping again afterward, or for uses like lilo trust that the admin > > is doing the right thing. That isn't a new issue with FIEMAP vs FIBMAP. > > So, I'm a big fan of better layout visualization and creating APIs to improve > it. At some point we need to take a step back and ask if those apis are > better left to other tools instead of heaping them all into fiemap. > The advantage of dropping the lun support from fiemap and pushing it into a > new ioctl/syscall is that we can determine the underlying storage topology > for any logical block on the device, including those underneath md/dm without > worrying about a backing file. Argh, that would make FIEMAP basically unsuitable for a mutli-device filesystem like Lustre and pNFS and the future direction of XFS (I think), and btrfs IMHO, or even ZFS. There just isn't a single address space that files can be mapped to. A major reason I proposed FIEMAP to linux-fsdevel in the first place, instead of just keeping it internal to Lustre and maybe ext4, is because it is generally useful interface for efficiently determining file layout information, and isn't tied to block devices like FIBMAP is. It is useful for many different reasons like cp/tar to skip holes (not using the physical offset information, just the logical extents and flags like UNWRITTEN) to avoid reading empty parts of the file, layout visualization, maybe defrag, etc. Dropping lun/device support, and removing all of the flexibility of the FIEMAP interface design, is IMHO killing the whole reason I proposed FIEMAP in the first place. > And then we can get interesting information about stripe widths, preferred > IO sizes etc etc. I agree that this part is somewhat orthogonal. In the vast majority of cases stripe width, stripe count, IO size, etc can be encapsulated into a small number of parameters and does not need to be specified on a per-block or per-extent basis. The actual layout of the file (returned by FIEMAP) is a natural consequence of these parameters, as they (may) influence the filesystem in making allocation decisions. Returning the metadata layout as part of FIEMAP makes sense to me, because it boils down to logical->{physical,device} ranges in the end. Returning the generic file layout parameters doesn't make sense, on the other hand. Lustre (and in fact several of the HPC filesystem vendors) would like to come up with a common API (virtual xattr?) to be able to extract and restore the generic layout information but that is for a separate email. Cheer , Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html