Re: [PATCH][RFC] nfsd/lockd: have locks_in_grace take a sb arg

Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> · Wed, 11 Apr 2012 14:09:40 +0400

10.04.2012 22:45, Jeff Layton пишет:
This check is expensive (as you mentioned), but have to be done only once on NFS
server start.

Well, no. The subtree check happens every time nfsd processes a
filehandle -- see nfsd_acceptable().

Basically we have to turn the filehandle into a dentry and then walk
back up to the directory that's exported to verify that it is within
the correct subtree. If that fails, then we might have to do it more
than once if it's a hardlinked file.

Wait. Looks like I'm missing something.
This subtree check has nothing with my proposal (if I'm not mistaken).
This option and it's logic remains the same.
My proposal was to check directories, desired to be exported, on NFS server
start. And if any of passed exports intersects with any of exports, already
shared by another NFSd - then shutdown NFSd and print error message.
Am I missing the point here?

Sorry I got confused with the discussion. You will need to do
something similar to what subtree checking does in order to handle
your proposal however.

Agreed. But this check should be performed only once on NFS server start (not 
every fh lookup.

With this solution, grace period can simple, and no support from exporting file
system is required.
But the main problem here is that such intersections can be checked only in
initial file system environment (containers with it's own roots, gained via
chroot, can't handle this situation).
So, it means, that there have to be some daemon (kernel or user space), which
will handle such requests from different NFS server instances... Which in turn
means, that some way of communication between this daemon and NFS servers is
required. And unix (any of them) sockets doesn't suits here, which makes this
problem more difficult.

This is a truly ugly problem, and unfortunately parts of the nfsd
codebase are very old and crusty. We've got a lot of cleanup work ahead
of us no matter what design we settle on.

This is really a lot bigger than the grace period. I think we ought to
step back a bit and consider this more "holistically" first. Do you
have a pointer to an overall design document or something?

What exactly you are asking about? Overall design of containerization?

I meant containerization of nfsd in particular.

If you are asking about some kind of white paper, then I don't have it.
But here are main visible targets:
1) Move all network-related resources to per-net data (caches, grace period, 
lockd calls, transports, your tracking engine).
2) make nfsd filesystem superblock per network namespace.
3) service itself will be controlled like Lockd done (one pool for all, per-net 
resources allocated on service start).

One thing that puzzles me at the moment. We have two namespaces to deal
with -- the network and the mount namespace. With nfs client code,
everything is keyed off of the net namespace. That's not really the
case here since we have to deal with a local fs tree as well.

When an nfsd running in a container receives an RPC, how does it
determine what mount namespace it should do its operations in?

We don't use mount namespaces, so that's why I wasn't thinking about it...
But if we have 2 types of namespaces, then we have to tie  mount namesapce to
network. I.e we can get desired mount namespace from per-net NFSd data.

One thing that Bruce mentioned to me privately is that we could plan to
use whatever mount namespace mountd is using within a particular net
namespace. That makes some sense since mountd is the final arbiter of
who gets access to what.

Could you, please, give some examples? I don't get the idea.

But, please, don't ask me, what will be, if two or more NFS servers shares the
same mount namespace... Looks like this case should be forbidden.

I'm not sure we need to forbid sharing the mount namespace. They might
be exporting completely different filesystems after all, in which case
we'd be forbidding it for no good reason.

Actually, if we will make file system responsible for grace period control, then 
yes, no reason for forbidding of shared mount namespace.

Note that it is quite easy to get lost in the weeds with this. I've been
struggling to get a working design for a clustered nfsv4 server for the
last several months and have had some time to wrestle with these
issues. It's anything but trivial.

What you may need to do in order to make progress is to start with some
valid use-cases for this stuff, and get those working while disallowing
or ignoring other use cases. We'll never get anywhere if we try to solve
all of these problems at once...

Agreed.
So, my current understanding of the situation can be summarized as follows:

1) The idea of making grace period (and int internals) per networks namespace 
stays the same. But it's implementation affect only current "generic grace 
period" code.

2) Your idea of making grace period per file system looks reasonable. And maybe 
this approach (using of filesystem's export operations if available) have to be 
used by default.
But I suggest to add new option to exports (say, "no_fs_grace"), which will 
disable this new functionality. With this option system administrator becomes 
responsible for any problems with shared file system.

Any objections?

--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html