Re: NFSv4/pNFS possible POSIX I/O API standards

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 30 Nov 2006 09:49:08 -0800 (PST)

On Thu, 30 Nov 2006, Christoph Hellwig wrote:
On Wed, Nov 29, 2006 at 12:26:22AM -0800, Brad Boyer wrote:
For a more extreme case, hfs and hfsplus don't even have a separation
between directory entries and inode information. The code creates this
separation synthetically to match the expectations of the kernel. During
a readdir(), the full catalog record is loaded from disk, but all that
is used is the information passed back to the filldir callback. The only
thing that would be needed to return extra information would be code to
copy information from the internal structure to whatever the system call
used to return data to the program.

In this case you can infact already instanciate inodes froms readdir.
Take a look at the NFS code.

Sure.  And having readdirplus over the wire is a great performance win for 
NFS, but it works only because NFS metadata consistency is already weak.

Giving applications an atomic readdirplus makes things considerably 
simpler for distributed filesystems that want to provide strong 
consistency (and a reasonable interpretation of what POSIX semantics mean 
for a distributed filesystem).  In particular, it allows the application 
(e.g. ls --color or -al) to communicate to the kernel and filesystem that 
it doesn't care about the relative ordering of each subsequent stat() with 
respect to other writers (possibly on different hosts, with whom 
synchronization can incur a heavy performance penalty), but rather only 
wants a snapshot of dentry+inode state.

As Andreas already mentioned, detecting this (exceedingly common) case may 
be possible with heuristics (e.g. watching the ordering of stat() calls vs 
the filldir resuls), but that's hardly ideal when a cleaner interface can 
explicitly capture the application's requirements.

sage
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html