Re: NFSv4/pNFS possible POSIX I/O API standards

Rob Ross <rross@xxxxxxxxxxx> · Wed, 06 Dec 2006 10:15:13 -0600

Matthew Wilcox wrote:
On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

I coincidentally just wrote about some of this in another email. Wasn't 
trying to avoid you...

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.

Exactly.

: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?

The fh_t would be validated either (a) when the openfh() is called, or 
on accesses using the associated capability. As Christoph pointed out, 
this really is a capability and encapsulates everything necessary for a 
particular user to access a particular file. It can be handed to others, 
and in fact that is a critical feature for our use case.

After the openfh(), the access model is identical to a previously 
open()ed file. So the question is what happens between the openg() and 
the openfh().

Our intention was to allow servers to "forget" these fh_ts at will. So a 
revoke between openg() and openfh() would kill the fh_t, and the 
subsequent openfh() would fail, or subsequent accesses would fail 
(depending on when the FS chose to validate).

Does this help?

: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
> :
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

We could use advice on this point. Certainly it's possible to encode 
information about the FS from which the fh_t originated, but we haven't 
tried to spell out exactly how that would happen. Your approach 
described here sounds good to me.

In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

Yes, naive application. You're right that the file system could adapt to 
this, but on the other hand if we were explicitly passing the fh_t in 
user space, we could just use MPI_Bcast and be done with it, with an 
algorithm that is well-matched to the system, etc.

However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.

That's good.

Absolutely. Same goes for readx()/writex() also, BTW, at least for 
MPI-IO users. We will build the input parameters inside MPI-IO using 
existing information from users, rather than applying data sieving or 
using multiple POSIX calls.

The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.

I'm sorry if it seems like I'm ignoring your concerns; that isn't my 
intention. I am advocating the calls though, because the whole point in 
getting into these discussions is to improve the state of things for 
these access patterns.

Part of the problem is that the descriptions of these calls were written 
for inclusion in a POSIX document and not for discussion on this list. 
Those descriptions don't usually include detailed descriptions of 
implementation options or use cases. We should have created some 
additional documentation before coming to this list, but what is done is 
done.

In the case of openg(), the major approach to things "going wrong" is 
for the server to just forget it ever handed out the fh_t and make the 
application figure it out. We think that makes implementations 
relatively simple, because we don't require so much. It makes using this 
capability a little more difficult outside the kernel, but we're 
prepared for that.

> I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.

Really? Ack! Ok. I'll talk with the others and get a readx()/writex() 
page up soon, although it would be nice to let the discussion of these 
few calm down a bit before we start with those...I'm not getting much 
done at work right now :).

Thanks for the discussion,

Rob

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html