On Wed, Dec 06, 2006 at 09:04:00AM -0600, Rob Ross wrote: > The openg() solution has the following advantages to what you propose. > First, it places the burden of the communication of the file handle on > the application process, not the file system. That means less work for > the file system. Second, it does not require that clients respond to > unexpected network traffic. Third, the network traffic is deterministic > -- one client interacts with the file system and then explicitly > performs the broadcast. Fourth, it does not require that the file system > store additional state on clients. You didn't address the disadvantages I pointed out on December 1st in a mail to the posix mailing list: : I now understand this not so much as a replacement for dup() but in : terms of being able to open by NFS filehandle, or inode number. The : fh_t is presumably generated by the underlying cluster filesystem, and : is a handle that has meaning on all nodes that are members of the : cluster. : : I think we need to consider security issues (that have also come up : when open-by-inode-number was proposed). For example, how long is the : fh_t intended to be valid for? Forever? Until the cluster is rebooted? : Could the fh_t be used by any user, or only those with credentials to : access the file? What happens if we revoke() the original fd? : : I'm a little concerned about the generation of a suitable fh_t. : In the implementation of sutoc(), how does the kernel know which : filesystem to ask to translate it? It's not impossible (though it is : implausible) that an fh_t could be meaningful to more than one : filesystem. : : One possibility of fixing this could be to use a magic number at the : beginning of the fh_t to distinguish which filesystem this belongs : to (a list of currently-used magic numbers in Linux can be found at : http://git.parisc-linux.org/?p=linux-2.6.git;a=blob;f=include/linux/magic.h) Christoph has also touched on some of these points, and added some I missed. > In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing > the flag) would likely cause a storm of network traffic if clients were > closely synchronized (which they are likely to be). I think you're referring to a naive application, rather than a naive cluster filesystem, right? There's several ways to fix that problem, including throttling broadcasts of information, having nodes ask their immediate neighbours if they have a cache of the information, and having the server not respond (wait for a retransmit) if it's recently sent out a broadcast. > However, the application change issue is actually moot; we will make > whatever changes inside our MPI-IO implementation, and many users will > get the benefits for free. That's good. > The readdirplus(), readx()/writex(), and openg()/openfh() were all > designed to allow our applications to explain exactly what they wanted > and to allow for explicit communication. I understand that there is a > tendency toward solutions where the FS guesses what the app is going to > do or is passed a hint (e.g. fadvise) about what is going to happen, > because these things don't require interface changes. But these > solutions just aren't as effective as actually spelling out what the > application wants. Sure, but I think you're emphasising "these interfaces let us get our job done" over the legitimate concerns that we have. I haven't really looked at the readdirplus() or readx()/writex() interfaces, but the security problems with openg() makes me think you haven't really looked at it from the "what could go wrong" perspective. I'd be interested in reviewing the readx()/writex() interfaces, but still don't see a document for them anywhere. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html