Re: NFSv4/pNFS possible POSIX I/O API standards

Matthew Wilcox <matthew@xxxxxx> · Wed, 6 Dec 2006 08:44:26 -0700

On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
> The openg() solution has the following advantages to what you propose. 
> First, it places the burden of the communication of the file handle on 
> the application process, not the file system. That means less work for 
> the file system. Second, it does not require that clients respond to 
> unexpected network traffic. Third, the network traffic is deterministic 
> -- one client interacts with the file system and then explicitly 
> performs the broadcast. Fourth, it does not require that the file system 
> store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.
:
: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?
:
: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
:
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

> In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
> the flag) would likely cause a storm of network traffic if clients were 
> closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

> However, the application change issue is actually moot; we will make 
> whatever changes inside our MPI-IO implementation, and many users will 
> get the benefits for free.

That's good.

> The readdirplus(), readx()/writex(), and openg()/openfh() were all 
> designed to allow our applications to explain exactly what they wanted 
> and to allow for explicit communication. I understand that there is a 
> tendency toward solutions where the FS guesses what the app is going to 
> do or is passed a hint (e.g. fadvise) about what is going to happen, 
> because these things don't require interface changes. But these 
> solutions just aren't as effective as actually spelling out what the 
> application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.  I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html