Re: readdirplus() as possible POSIX I/O API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Dec 04, 2006  10:15 -0500, Trond Myklebust wrote:
> I propose that we implement this sort of thing in the kernel via a readdir
> equivalent to posix_fadvise(). That can give exactly the barrier
> semantics that they are asking for, and only costs 1 extra syscall as
> opposed to 2 (opendirplus() and readdirplus()).

I think the "barrier semantics" are something that have just crept
into this discussion and is confusing the issue.

The primary goal (IMHO) of this syscall is to allow the filesystem
(primarily distributed cluster filesystems, but HFS and NTFS developers
seem on board with this too) to avoid tens to thousands of stat RPCs in
very common ls -R, find, etc. kind of operations.

I can't see how fadvise() could help this case?  Yes, it would tell the
filesystem that it could do readahead of the readdir() data, but the
app will still be doing stat() on each of the thousands of files in the
directory, instantiating inodes and dentries on that node (which need
locking, and potentially immediate lock revocation if the files are
being written to by other nodes).  In some cases (e.g. rm -r, grep -r)
that might even be a win, because the client will soon be touching all
of those files, but not necessarily in the ls -lR, find cases.

The filesystem can't always do "stat-ahead" on the files because that
requires instantiating an inode on the client which may be stale (lock
revoked) by the time the app gets to it, and the app (and the VFS)  have
no idea just how stale it is, and whether the stat is a "real" stat or
"only" the readdir stat (because the fadvise would only be useful on
the directory, and not all of the child entries), so it would need to
re-stat the file.  Also, this would potentially blow the client's real
working set of inodes out of cache.

Doing things en-masse with readdirplus() also allows the filesystem to
do the stat() operations in parallel internally (which is a net win if
there are many servers involved) instead of serially as the application
would do.

Cheers, Andreas

PS - I changed the topic to separate this from the openfh() thread.
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux