Re: [PATCH 1/5] vfs: vfs-level fiemap interface

Mark Fasheh <mfasheh@xxxxxxxx> · Tue, 27 May 2008 11:31:17 -0700

On Sun, May 25, 2008 at 01:28:16AM -0600, Andreas Dilger wrote:
> On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> > Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> > inode operation.
> 
> Mark, this appears to be missing most of my previous comments.  Are
> you getting my emails, or is there still a problem?

No, I got your mail.. "missing most of" is an exaggeration though, isn't it?
I just missed one or two things  ;)

> > +++ b/Documentation/filesystems/fiemap.txt
> > +A fiemap request is encoded within struct fiemap:
> > +
> > +struct fiemap {
> > +	__u64	fm_start;	 /* logical offset (inclusive) at
> > +				  * which to start mapping (in) */
> > +	__u64	fm_length;	 /* logical length of mapping which
> > +				  * userspace cares about (in) */
> > +	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */
> 
> This should be marked (in/out) because of error return, documented below.

Ahh, yeah I updated it in the header, but forgot about the document.

> > +* FIEMAP_EXTENT_UNKNOWN
> > +The location of this extent is currently unknown. This may indicate
> > +the data is stored on an inaccessible volume or that no storage has
> > +been allocated for the file yet.
> > +
> > +* FIEMAP_EXTENT_SECONDARY
> > +  - This will also set FIEMAP_EXTENT_UNKNOWN.
> > +The data for this extent is in secondary storage.
> 
> This is not correct that it will always set FIEMAP_EXTENT_UNKNOWN here.
> It is possible (common) for data to be stored on both primary AND
> secondary storage at the same time.  It is still desirable to mark the
> data as FIEMAP_EXTENT_SECONDARY even if it is still resident in the
> filesystem, because this informs the user that the HSM/backup system
> has made a copy of this file already.

Ok. I think I need to modify fiemap_fill_next_extent() slightly for that
too.

> > +For each extent in the request range, the file system should call
> > +the helper function, fiemap_fill_next_extent():
> > +
> > +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > +			    u64 phys, u64 len, u32 flags, u32 lun);
> > +
> > +fiemap_fill_next_extent() will use the passed values to populate the
> > +next free extent in the fm_extents array. 'General' extent flags will
> > +automatically be set from specific flags on behalf of the calling file
> > +system so that the userspace API is not broken.
> > +
> > +fiemap_fill_next_extent() returns 0 on success, and 1 when the
> > +user-supplied fm_extents array is full. If an error is encountered
> > +while copying the extent to user memory, -EFAULT will be returned.
> > +
> > +If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > +this helper is not necessary and fi_extents_mapped can be set
> > +directly.
> 
> It is worthwhile to mention here that if FIEMAP_FLAG_NUM_EXTENTS IS set,
> it is also OK for the filesystem to call fiemap_fill_next_extent(), and
> only the number of extents will be counted, avoiding the need to special-
> case this flag in the filesystem implementation.

Right - good idea.

> What about the idea to have fiemap_fill_next_extent() do "extent" merging
> for filesystems that use the generic helper but do not return multiple
> blocks via get_blocks()?  I don't think that is too hard to implement,
> and makes the output much more useful, otherwise we get an extent per block.
> The above is what I _think_ will work, haven't actually tried it out.

I don't think we want to automatically merge extents within this helper
function. Otherwise we would diverge from the actual disk layout for extent
based file systems where an extent might be broken up between two records
for some other reason, such as maximum extent length being exceeded.

It probably makes more sense to encode this merging in the block-based
function instead since it's pretty specific to those.

Btw, how many block-based file systems that don't return multiple blocks via
get_blocks() are there that we actually care about enough to write this
code?

> > +static int ioctl_fiemap(struct file *filp, unsigned long arg)
> > +{
> > +	struct fiemap fiemap;
> > +	u64 len;
> > +	u32 incompat_flags;
> > +	struct fiemap_extent_info fieinfo = {0, };
> > +	struct inode *inode = filp->f_path.dentry->d_inode;
> > +	struct super_block *sb = inode->i_sb;
> > +	int error = 0;
> > +
> > +	if (!inode->i_op->fiemap)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (copy_from_user(&fiemap, (struct fiemap __user *) arg,
> > +			   sizeof(struct fiemap)))
> > +		return -EFAULT;
> > +
> > +	/*
> > +	 * The VFS does basic sanity checks on the flag
> > +	 * value. Individual file systems can also reject otherwise
> > +	 * 'valid' flags by returning -EBADR from ->fiemap.
> > +	 */
> > +	incompat_flags = fiemap.fm_flags & ~FIEMAP_FLAGS_COMPAT;
> > +	if (incompat_flags)
> > +		goto out_bad_flags;
> 
> Please remove this, and leave the checking for the individual filesystems.
> They all need to do it anyways, because they can't depend on the check here
> never changing or always being updated when this check does.

I hope that when we add new feature flags to fiemap, we take the time to
do it right and add it to FIEMAP_FLAGS_COMPAT, which is defined in the same
header for that reason. Likewise, we can't just add flags in any case
without considering how they affect existing users of ->fiemap. If we add a
flag that everyone can implement for example, we'll have to touch all file
systems anyway.

> At best it is redundant with the check in the filesystem, at worst it
> will lead to bugs in the filesystem-specific code due to inconsistent
> checks.

I makes sense to me that the VFS should verify that flag values passed
are valid in that they've been defined by the kernel <-> user interface. The
file systems can and should still reject flags which they can not support.
So the checks are *designed* to not be consistent, but rather allow the
client file system to whittle down the supported flags further.

Otherwise, it seems to me that client file systems could start to define
their own flags which might eventually stomp on those we'd like to add
later.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html