Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andreas,

On 13 Apr 2007, at 05:01, Andreas Dilger wrote:
On Apr 12, 2007  12:22 +0100, Anton Altaparmakov wrote:
On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
I'm interested in getting input for implementing an ioctl to
efficiently map file extents & holes (FIEMAP) instead of looping
over FIBMAP a billion times.  We already have customers with single
files in the 10TB range and we additionally need to get the mapping
over the network so it needs to be efficient in terms of how data
is passed, and how easily it can be extracted from the filesystem.

struct fibmap_extent {
	__u64 fe_start;			/* starting offset in bytes */
	__u64 fe_len;			/* length in bytes */
}

struct fibmap {
struct fibmap_extent fm_start; /* offset, length of desired mapping */
	__u32 fm_extent_count;		/* number of extents in array */
	__u32 fm_flags;			/* flags for input request */
	XFS_IOC_GETBMAP) */
	__u64 unused;
	struct fibmap_extent fm_extents[0];
}

#define FIEMAP_LEN_MASK		0xff000000000000
#define FIEMAP_LEN_HOLE     	0x01000000000000
#define FIEMAP_LEN_UNWRITTEN	0x02000000000000

Sound good but I would add:

#define FIEMAP_LEN_NO_DIRECT_ACCESS

This would say that the offset on disk can move at any time or that
the data is compressed or encrypted on disk thus the data is not
useful for direct disk access.

This makes sense. Even for Reiserfs the same is true with packed tails, and I believe if FIBMAP is called on a tail it will migrate the tail into
a block because this is might be a sign that the file is a kernel that
LILO wants to boot.

I'd rather not have any such feature in FIEMAP, and just return the
on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
My main reason for FIEMAP is being able to investigate allocation patterns
of files.

By no means is my flag list exhaustive, just the ones that I thought would
be needed to implement this for ext4 and Lustre.

Sure, hence why I made my comment for NTFS. (-: And yes, ReiserFS and even ext* could use such flag. I believe there is a compression patch for ext somewhere isn't there? (Or at least there was one at some point I think...)

Also why are you not using 0xff00000000000000, i.e. two more zeroes
at the end?  Seems unnecessary to drop an extra 8 bits of
significance from the byte size...

It was actually just a typo (this was the first time I'd written the
structs and flags down, it is just at the discussion stage). I'd meant for it to be 2^56 bytes for the file size as I wrote later in the email.

Ok.  (-:

That said, I think that 2^48 bytes is probably sufficient for most uses, so that we get 16 bits for flags. As it is this email already discusses
5 flags, and that would give little room for expansion in the future.

Remember, this is the mapping for a single file (which can't practially be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to return a few separate extents which are actually contiguous (assuming that there will actually be files in filesystems with > 2^48 bytes of contiguous space). Since the API is that it will return the extent that contains the requested "start" byte, the kernel will be able to detect this case also, since it won't be able to specify a length for the extent that contains the
start byte.

Valid point. As long as the "on-disk location" is maintained as full 64 bits then you are right we could just return multiple extents if the space does not fit. A bit of a kludge but it would certainly work. An alternative would be to have the flags in a separate field but that would add 8-bytes to the structure size if you want to maintain 8-byte alignment so that would not be great...

At most we'd have to call the ioctl() 65536 times for a completely
contiguous 2^64 byte file if the buffer was only large enough for a
single extent. In reality, I expect any file to have some discontinuities and the buffer to be large enough for a thousand or more entries so the
corner case is not very bad.

Finally please make sure that the file system can return in one way
or another errors for example when it fails to determine the extents
because the system ran out of memory, there was an i/o error,
whatever...  It may even be useful to be able to say "here is an
extent of size X bytes but we do not know where it is on disk because
there was an error determining this particular extent's on-disk
location for some reason or other"...

Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
FIEMAP_LEN_ERROR.  Consider FIEMAP on a file that was migrated
to tape and currently has no blocks allocated in the filesystem.  We
want to return some indication that there is actual file data and not
just a hole, but at the same time we don't want this to actually return
the file from tape just to generate block mappings for it.

Yes, NTFS also has off line storage (DFS - the Distributed File System I think it is called) but we don't support any of that. Perhaps one day...

This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ, but this needs to be specified on input to prevent the file being mapped and I'd rather the opposite (not getting file from tape) be the default,
by principle of least surprise.

block-aligned/sized allocations (e.g. tail packing).  The
fm_extents array
returned contains the packed list of allocation extents for the file,
including entries for holes (which have fe_start == 0, and a flag).

Why the fe_start == 0?  Surely just the flag is sufficient...  On
NTFS it is perfectly valid to have fe_start == 0 and to have that not
be sparse (normally the $Boot system file is stored in the first 8
sectors of the volume)...

I thought fe_start = 0 was pretty standard for a hole.  It should be
something and I'd rather 0 than anything else. The _HOLE flag is enough
as you say though.

It is standard on Unix. I am trying to fight this standard because of NTFS... On NTFS a hole is -1 not 0 and zero is a valid block. But on NTFS device locations are "s64" not "u64" so the -1 is logical to use...

As long as it is made clear that people MUST check the flag when fe_start == 0 rather than assume that fe_start == 0 means a hole I am happy with that. Hopefully not too many programmers will be lazy gits who will ignore this and just check fe_start == 0 or they will fail on NTFS and assume $Boot is sparse when it is not...

PS - I'd thought about adding you to the CC list for this, because I know
     you've had opinions on FIBMAP in the past, but I didn't have
your email handy and it was late, and I know you saw the NTFS kmap
     patch on fsdevel so I figured you would see this too...

Thanks. Yes, I try to follow fsdevel closely and LKML not so closely (I often read it with "select all new, delete")...

     Thanks for your input.

You are welcome.

Best regards,

	Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux