At 05:59 PM 12/4/2006, Rob Ross wrote:
Hi all,
I don't think that the group intended that there be an
opendirplus(); rather readdirplus() would simply be called instead
of the usual readdir(). We should clarify that.
Regarding Peter Staubach's comments about no one ever using the
readdirplus() call; well, if people weren't performing this workload
in the first place, we wouldn't *need* this sort of call! This call
is specifically targeted at improving "ls -l" performance on large
directories, and Sage has pointed out quite nicely how that might work.
In our case (PVFS), we would essentially perform three phases of
communication with the file system for a readdirplus that was
obtaining full statistics: first grabbing the directory entries,
then obtaining metadata from servers on all objects in bulk, then
gathering file sizes in bulk. The reduction in control message
traffic is enormous, and the concurrency is much greater than in a
readdir()+stat()s workload. We'd never perform this sort of
optimization optimistically, as the cost of guessing wrong is just
too high. We would want to see the call as a proper VFS operation
that we could act upon.
The entire readdirplus() operation wasn't intended to be atomic, and
in fact the returned structure has space for an error associated
with the stat() on a particular entry, to allow for implementations
that stat() subsequently and get an error because the object was
removed between when the entry was read out of the directory and
when the stat was performed. I think this fits well with what
Andreas and others are thinking. We should clarify the description
appropriately.
I don't think that we have a readdirpluslite() variation documented
yet? Gary? It would make a lot of sense. Except that it should
probably have a better name...
Correct, we do not have that documented. I suppose we could just
have a mask like
statlite and keep it to one call perhaps.
Regarding Andreas's note that he would prefer the statlite() flags
to mean "valid", that makes good sense to me (and would obviously
apply to the so-far even more hypothetical readdirpluslite()). I
don't think there's a lot of value in returning possibly-inaccurate values?
The one use that some users talk about is just knowing the file is
growing is important and useful to them,
knowing exactly to the byte how much growth seems less important to
them until they close.
On these big parallel apps, so many things can happen that can just
hang. They often use
the presence of checkpoint files and how big they are to gage
progress of he application.
Of course there are other ways this can be accomplished but they do
this sort of thing
a lot. That is the main case I have heard that might benefit from
"possibly-inaccurate" values.
Of course it assumes that the inaccuracy is just old information and
not bogus information.
Thanks, we will put out a complete version of what we have in a
document to the Open Group
site in a week or two so all the pages in their current state are
available. We could then
begin some iteration on all these comments we have gotten from the
various communities.
Thanks
Gary
Thanks everyone,
Rob
Trond Myklebust wrote:
On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote:
I'm wondering if a corresponding opendirplus() (or similar) would
also be appropriate to inform the kernel/filesystem that
readdirplus() will follow, and stat information should be
gathered/buffered. Or do most implementations wait for the first
readdir() before doing any actual work anyway?
I'm not sure what some filesystems might do here. I suppose NFS has weak
enough cache semantics that it _might_ return stale cached data from the
client in order to fill the readdirplus() data, but it is just as likely
that it ships the whole thing to the server and returns everything in
one shot. That would imply everything would be at least as up-to-date
as the opendir().
Whether or not the posix committee decides on readdirplus, I propose
that we implement this sort of thing in the kernel via a readdir
equivalent to posix_fadvise(). That can give exactly the barrier
semantics that they are asking for, and only costs 1 extra syscall as
opposed to 2 (opendirplus() and readdirplus()).
Cheers
Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html