On Dec 04, 2006 10:15 -0500, Trond Myklebust wrote: > I propose that we implement this sort of thing in the kernel via a readdir > equivalent to posix_fadvise(). That can give exactly the barrier > semantics that they are asking for, and only costs 1 extra syscall as > opposed to 2 (opendirplus() and readdirplus()). I think the "barrier semantics" are something that have just crept into this discussion and is confusing the issue. The primary goal (IMHO) of this syscall is to allow the filesystem (primarily distributed cluster filesystems, but HFS and NTFS developers seem on board with this too) to avoid tens to thousands of stat RPCs in very common ls -R, find, etc. kind of operations. I can't see how fadvise() could help this case? Yes, it would tell the filesystem that it could do readahead of the readdir() data, but the app will still be doing stat() on each of the thousands of files in the directory, instantiating inodes and dentries on that node (which need locking, and potentially immediate lock revocation if the files are being written to by other nodes). In some cases (e.g. rm -r, grep -r) that might even be a win, because the client will soon be touching all of those files, but not necessarily in the ls -lR, find cases. The filesystem can't always do "stat-ahead" on the files because that requires instantiating an inode on the client which may be stale (lock revoked) by the time the app gets to it, and the app (and the VFS) have no idea just how stale it is, and whether the stat is a "real" stat or "only" the readdir stat (because the fadvise would only be useful on the directory, and not all of the child entries), so it would need to re-stat the file. Also, this would potentially blow the client's real working set of inodes out of cache. Doing things en-masse with readdirplus() also allows the filesystem to do the stat() operations in parallel internally (which is a net win if there are many servers involved) instead of serially as the application would do. Cheers, Andreas PS - I changed the topic to separate this from the openfh() thread. -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html