Re: NFSv4/pNFS possible POSIX I/O API standards

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Matthew Wilcox wrote:
On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote:
The openg() solution has the following advantages to what you propose. First, it places the burden of the communication of the file handle on the application process, not the file system. That means less work for the file system. Second, it does not require that clients respond to unexpected network traffic. Third, the network traffic is deterministic -- one client interacts with the file system and then explicitly performs the broadcast. Fourth, it does not require that the file system store additional state on clients.

You didn't address the disadvantages I pointed out on December 1st in a
mail to the posix mailing list:

I coincidentally just wrote about some of this in another email. Wasn't trying to avoid you...

: I now understand this not so much as a replacement for dup() but in
: terms of being able to open by NFS filehandle, or inode number.  The
: fh_t is presumably generated by the underlying cluster filesystem, and
: is a handle that has meaning on all nodes that are members of the
: cluster.

Exactly.

: I think we need to consider security issues (that have also come up
: when open-by-inode-number was proposed).  For example, how long is the
: fh_t intended to be valid for?  Forever?  Until the cluster is rebooted?
: Could the fh_t be used by any user, or only those with credentials to
: access the file?  What happens if we revoke() the original fd?

The fh_t would be validated either (a) when the openfh() is called, or on accesses using the associated capability. As Christoph pointed out, this really is a capability and encapsulates everything necessary for a particular user to access a particular file. It can be handed to others, and in fact that is a critical feature for our use case.

After the openfh(), the access model is identical to a previously open()ed file. So the question is what happens between the openg() and the openfh().

Our intention was to allow servers to "forget" these fh_ts at will. So a revoke between openg() and openfh() would kill the fh_t, and the subsequent openfh() would fail, or subsequent accesses would fail (depending on when the FS chose to validate).

Does this help?

: I'm a little concerned about the generation of a suitable fh_t.
: In the implementation of sutoc(), how does the kernel know which
: filesystem to ask to translate it?  It's not impossible (though it is
: implausible) that an fh_t could be meaningful to more than one
: filesystem.
> :
: One possibility of fixing this could be to use a magic number at the
: beginning of the fh_t to distinguish which filesystem this belongs
: to (a list of currently-used magic numbers in Linux can be found at
: http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h)

Christoph has also touched on some of these points, and added some I
missed.

We could use advice on this point. Certainly it's possible to encode information about the FS from which the fh_t originated, but we haven't tried to spell out exactly how that would happen. Your approach described here sounds good to me.

In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing the flag) would likely cause a storm of network traffic if clients were closely synchronized (which they are likely to be).

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out
a broadcast.

Yes, naive application. You're right that the file system could adapt to this, but on the other hand if we were explicitly passing the fh_t in user space, we could just use MPI_Bcast and be done with it, with an algorithm that is well-matched to the system, etc.

However, the application change issue is actually moot; we will make whatever changes inside our MPI-IO implementation, and many users will get the benefits for free.

That's good.

Absolutely. Same goes for readx()/writex() also, BTW, at least for MPI-IO users. We will build the input parameters inside MPI-IO using existing information from users, rather than applying data sieving or using multiple POSIX calls.

The readdirplus(), readx()/writex(), and openg()/openfh() were all designed to allow our applications to explain exactly what they wanted and to allow for explicit communication. I understand that there is a tendency toward solutions where the FS guesses what the app is going to do or is passed a hint (e.g. fadvise) about what is going to happen, because these things don't require interface changes. But these solutions just aren't as effective as actually spelling out what the application wants.

Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.

I'm sorry if it seems like I'm ignoring your concerns; that isn't my intention. I am advocating the calls though, because the whole point in getting into these discussions is to improve the state of things for these access patterns.

Part of the problem is that the descriptions of these calls were written for inclusion in a POSIX document and not for discussion on this list. Those descriptions don't usually include detailed descriptions of implementation options or use cases. We should have created some additional documentation before coming to this list, but what is done is done.

In the case of openg(), the major approach to things "going wrong" is for the server to just forget it ever handed out the fh_t and make the application figure it out. We think that makes implementations relatively simple, because we don't require so much. It makes using this capability a little more difficult outside the kernel, but we're prepared for that.

> I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.

Really? Ack! Ok. I'll talk with the others and get a readx()/writex() page up soon, although it would be nice to let the discussion of these few calm down a bit before we start with those...I'm not getting much done at work right now :).

Thanks for the discussion,

Rob

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux