Re: readdirplus() as possible POSIX I/O API

Trond Myklebust <trond.myklebust@xxxxxxxxxx> · Tue, 05 Dec 2006 10:23:44 -0500

On Tue, 2006-12-05 at 03:26 -0700, Andreas Dilger wrote:
> On Dec 04, 2006  10:15 -0500, Trond Myklebust wrote:
> > I propose that we implement this sort of thing in the kernel via a readdir
> > equivalent to posix_fadvise(). That can give exactly the barrier
> > semantics that they are asking for, and only costs 1 extra syscall as
> > opposed to 2 (opendirplus() and readdirplus()).
> 
> I think the "barrier semantics" are something that have just crept
> into this discussion and is confusing the issue.

It is the _only_ concept that is of interest for something like NFS or
CIFS. We already have the ability to cache the information.

> The primary goal (IMHO) of this syscall is to allow the filesystem
> (primarily distributed cluster filesystems, but HFS and NTFS developers
> seem on board with this too) to avoid tens to thousands of stat RPCs in
> very common ls -R, find, etc. kind of operations.
> 
> I can't see how fadvise() could help this case?  Yes, it would tell the
> filesystem that it could do readahead of the readdir() data, but the
> app will still be doing stat() on each of the thousands of files in the
> directory, instantiating inodes and dentries on that node (which need
> locking, and potentially immediate lock revocation if the files are
> being written to by other nodes).  In some cases (e.g. rm -r, grep -r)
> that might even be a win, because the client will soon be touching all
> of those files, but not necessarily in the ls -lR, find cases.

'find' should be quite happy with the existing readdir(). It does not
need to use stat() or readdirplus() in order to recurse because
readdir() provides d_type.

The locking problem is only of interest to clustered filesystems. On
local filesystems such as HFS, NTFS, and on networked filesystems like
NFS or CIFS, the only lock that matters is the parent directory's
inode->i_sem, which is held by readdir() anyway.

If the application is able to select a statlite()-type of behaviour with
the fadvise() hints, your filesystem could be told to serve up cached
information instead of regrabbing locks. In fact that is a much more
flexible scheme, since it also allows the filesystem to background the
actual inode lookups, or to defer them altogether if that is more
efficient.

> The filesystem can't always do "stat-ahead" on the files because that
> requires instantiating an inode on the client which may be stale (lock
> revoked) by the time the app gets to it, and the app (and the VFS)  have
> no idea just how stale it is, and whether the stat is a "real" stat or
> "only" the readdir stat (because the fadvise would only be useful on
> the directory, and not all of the child entries), so it would need to
> re-stat the file.

Then provide hints that allow the app to select which behaviour it
prefers. Most (all?) apps don't _care_, and so would be quite happy with
cached information. That is why the current NFS caching model exists in
the first place.

> Also, this would potentially blow the client's real
> working set of inodes out of cache.

Why?

> Doing things en-masse with readdirplus() also allows the filesystem to
> do the stat() operations in parallel internally (which is a net win if
> there are many servers involved) instead of serially as the application
> would do.

If your application really cared, it could add threading to 'ls' to
achieve the same result. You can also have the filesystem preload that
information based on fadvise hints.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html