Re: [PATCH 1/2] [RFC] vfs: 'stat light' fstatat flags

Mark Fasheh <mfasheh@xxxxxxxx> · Tue, 7 Apr 2009 08:29:59 -0700

On Tue, Apr 07, 2009 at 03:23:32AM -0700, Andreas Dilger wrote:
> On Apr 07, 2009  01:00 -0700, Mark Fasheh wrote:
> > This very, very rough patch set adds three flags to fstatat - AT_NO_SIZE,
> > AT_NO_TIMES, and AT_STRICT.
> 
> It seems you and Oleg have two patches crossing in the night...

Heh, it seems so. At least we went from having zero interest in implementing
this to two patches! At any rate, thanks for taking the time to review them.

> > The first two flags (AT_NO_SIZE, AT_NO_TIMES) allow userspace to notify the
> > file system layer that certain stat fields are not required to be accurate.
> 
> The problem with "AT_NO_*" is that old applications which couldn't possibly
> know about or use a new stat field couldn't possibly know not to ask for
> it. 

Well, if we add stat fields, we need new system calls _anyway_ because
struct stat has to undergo a change. At that point, adding a new flag to the
modified fstatat, or just changing the behavior of an existing flag in a
backwards compatible manner should be quite easy in comparison.

> Instead, as was proposed in the last generation of this thread, there
> should be AT_GET_{ATIME,MTIME,CTIME,BLOCKS,SIZE,NLINKS,...}, to return the
> flags that the application actually wants.  

Well, I respectfully disagree. In my opinion, this adds unncessary
complexity to an already targetted case, where only a handful of file
systems will even care to optimize away a couple of fields. In that light, I
really think we want simplicity here. The good news is that we still have a
very good amount of space in the flags field. We could always go back if we
_really_ find the need to micro-optimize to the point of per stat field
flags. Again though, at this point I don't really think starting out with a
slew of flags is a good idea.

> If none of them are specified, then the current behaviour of "get all
> attributes" is kept.

This of course, is how it works today - if none of the new flags are
specified you get the exact old behavior. In fact, most file systems will
wind up just happily ignoring the new flags.

> 
> > Some file systems want this information in order to optimize away expensive
> > operations associated with stat. In particular, NFS can avoid some syncing
> > to the server (if userspace doesn't want atime, ctime or mtime) and Lustre
> > can avoid some expensive locking by avoiding an update of various size
> > fields (st_size, st_blocks).
> 
> Actually, despite what was said today, Lustre doesn't revoke the writing
> client's locks when getting the file size, unlike block-based filesystems.
> That said, it is still relatively a lot of work to query the size for
> a widely-striped file since there may be a hundred servers holding the
> file data and any one of them might have the end-of-file.

Ahh ok, but the point still stands, right? It costs a significant amount of
performance to query each file stripe for the purposes of a size calculation
during a simple 'ls'. Is the behavior as implemented by AT_SIZE sufficient
for Lustre to avoid such a performance loss?

> > AT_STRICT allows userspace to indicate that it wants the most up to date
> > version of a files status, regardless of performance impact. A distributed
> > file system which has a non-coherent inode cache would know then to send a
> > direct query to it's server.
> 
> The issue with AT_STRICT is that applications TODAY are expecting the
> proper file status.  I'd be more inclined to have an AT_LAZY flag for
> applications that do NOT require the most up-to-date information.

The reason I put that in there is because I believe there may be some
file systems which implement the 'lazy' behavior for regular stat so this
seemed a good way to short circuit that caching for those apps that really
care.

That said, I'm not 100% sure either AT_STRICT or AT_LAZY are really required
for a first pass - it just seemed convenient to throw AT_STRICT in there.

> > As noted previously, these patches are really rough. Mostly I'd like to get
> > some feedback on the interface and general direction of implementation. Some
> > glaring issues which we want to resolve:
> > 
> > - This patch set doesn't actually wire up any client file systems yet :)
> > 
> > - There's the question of whether we wire up [fl]stat(2) variants instead of
> >   using fstatat. I went with the former as the implementation was more
> >   straight forward.
> 
> It appears that fstatat(2) fits the need for "statlite" cleanly.  I'd
> thought today that this would require opening each file, but it is
> only necessary to open the parent directory and call fstatat() for
> each filename in the directory.
> 
> > - I'm not sure whether we want to force zeroing of the optional fields, or
> >   return whatever's in the inode (which may be stale, or just junk).
> 
> I think yes, since there have been security bugs filed in rare cases when
> other bits of kernel data are exposed to user space, even if just a byte
> or two.

Yes, good idea. I'll implement that zeroing via the client fs ->setattr
methods in my next pass. By the way, I think that adds to my argument that
we don't really want to micro-optimize the flags yet - now all file systems
will have to switch on a ton of attr flags during their zeroing... Granted
that can be turned into an exported function but hopefully you see my
point.

> > -int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
> > +int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat,
> > +		int flags)
> >  {
> >  	if (inode->i_op->getattr)
> > -		return inode->i_op->getattr(mnt, dentry, stat);
> > +		return inode->i_op->getattr(mnt, dentry, stat, attr_flags);
> 
> Oleg's version added a second ->getattr_lite() call that passed the flags,
> which avoids changing all of the filesystems, though I guess it isn't a
> huge deal either way.

Yeah, I'll look at how he did it. I just didn't want this to return ENOSYS
or whatnot if a file system doesn't have ->getattr_lite() but there's other
ways around that.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html