Re: readdirplus() as possible POSIX I/O API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2006-12-06 at 03:28 -0700, Andreas Dilger wrote:
> On Dec 05, 2006  10:23 -0500, Trond Myklebust wrote:
> > On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote:
> > > I think the "barrier semantics" are something that have just crept
> > > into this discussion and is confusing the issue.
> > 
> > It is the _only_ concept that is of interest for something like NFS or
> > CIFS. We already have the ability to cache the information.
> 
> Actually, wouldn't the ability for readdirplus() (with valid flag) be
> useful for NFS if only to indicate that it does not need to flush the
> cache because the ctime/mtime isn't needed by the caller?

That is why statlite() might be useful. I'd prefer something more
generic, though.

> > 'find' should be quite happy with the existing readdir(). It does not
> > need to use stat() or readdirplus() in order to recurse because
> > readdir() provides d_type.
> 
> It does in any but the most simplistic invocations, like "find -mtime"
> or "find -mode" or "find -uid", etc.

The only 'win' a readdirplus may give you there as far as NFS is
concerned is the sysenter overhead that you would have for calling
stat() (or statlite() many times).

> > If the application is able to select a statlite()-type of behaviour with
> > the fadvise() hints, your filesystem could be told to serve up cached
> > information instead of regrabbing locks. In fact that is a much more
> > flexible scheme, since it also allows the filesystem to background the
> > actual inode lookups, or to defer them altogether if that is more
> > efficient.
> 
> I guess I just don't understand how fadvise() on a directory file handle
> (used for readdir()) can be used to affect later stat operations (which
> definitely will NOT be using that file handle)?  If you mean that the
> application should actually open() each file, fadvise(), fstat(), close(),
> instead of just a stat() call then we are WAY into negative improvements
> here due to overhead of doing open+close.

On the contrary, the readdir descriptor is used in all those funky new
statat(), calls. Ditto for readlinkat(), faccessat().

You could even have openat() turn off the close-to-open GETATTR if the
readdir descriptor contained a hint that told it that was unnecessary.

Furthermore, since the fadvise-like caching operation works on
filehandles, you could have it work both on readdir() for the benefit of
the above *at() calls, and also on the regular file descriptor for the
benefit of fstat(), fgetxattr().

> > > The filesystem can't always do "stat-ahead" on the files because that
> > > requires instantiating an inode on the client which may be stale (lock
> > > revoked) by the time the app gets to it, and the app (and the VFS)  have
> > > no idea just how stale it is, and whether the stat is a "real" stat or
> > > "only" the readdir stat (because the fadvise would only be useful on
> > > the directory, and not all of the child entries), so it would need to
> > > re-stat the file.
> > 
> > Then provide hints that allow the app to select which behaviour it
> > prefers. Most (all?) apps don't _care_, and so would be quite happy with
> > cached information. That is why the current NFS caching model exists in
> > the first place.
> 
> Most clustered filesystems have strong cache semantics, so that isn't
> a problem.  IMHO, the mechanism to pass the hint to the filesystem IS
> the readdirplus_lite() that tells the filesystem exactly which data is
> needed on each directory entry.
> 
> > > Also, this would potentially blow the client's real
> > > working set of inodes out of cache.
> > 
> > Why?
> 
> Because in many cases it is desirable to limit the number of DLM locks
> on a given client (e.g. GFS2 thread with AKPM about clients with
> millions of DLM locks due to lack of memory pressure on large mem systems).
> That means a finite-size lock LRU on the client that risks being wiped
> out by a few thousand files in a directory doing "readdir() + 5000*stat()".
> 
> 
> Consider a system like BlueGene/L with 128k compute cores.  Jobs that
> run on that system will periodically (e.g. every hour) create up to 128K
> checkpoint+restart files to avoid losing a lot of computation if a node
> crashes.  Even if each one of the checkpoints is in a separate directory
> (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing
> an "ls" in the directory.

That is precisely the sort of situation where knowing when you can
cache, and when you cannot would be a plus. An ls call may not need 128k
dlm locks, because it only cares about the state of the inodes as they
were at the start of the opendir() call.

> > > Doing things en-masse with readdirplus() also allows the filesystem to
> > > do the stat() operations in parallel internally (which is a net win if
> > > there are many servers involved) instead of serially as the application
> > > would do.
> > 
> > If your application really cared, it could add threading to 'ls' to
> > achieve the same result. You can also have the filesystem preload that
> > information based on fadvise hints.
> 
> But it would still need 128K RPCs to get that information, and 128K new
> inodes on that client.  And what is the chance that I can get a
> multi-threading "ls" into the upstream GNU ls code?  In the case of local
> filesystems multi-threading ls would be a net loss due to seeking.

NFS doesn't 'cos it implements readdirplus under the covers as far as
userland is concerned.

> But even for local filesystems readdirplus_lite() would allow them to
> fill in stat information they already have (either in cache or on disk),
> and may avoid doing extra work that isn't needed.  For filesystems that
> don't care, readdirplus_lite() can just be readdir()+stat() internally.

The thing to note, though, is that in the NFS implementation we are
_very_ careful about use the GETATTR information it returns if there is
already an inode instantiated for that dentry. This is precisely because
we don't want to deal with the issue of synchronisation w.r.t. an inode
that may be under writeout, that may be the subject of setattr() calls,
etc. As far as we're concerned, READDIRPLUS is a form of mass LOOKUP,
not a mass inode revalidation

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux