Re: call_usermodehelper in containers

Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> · Fri, 15 Nov 2013 14:40:10 +0400

12.11.2013 17:30, Jeff Layton пишет:
On Tue, 12 Nov 2013 17:02:36 +0400
Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> wrote:

12.11.2013 15:12, Jeff Layton пишет:
On Mon, 11 Nov 2013 16:47:03 -0800
Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:

On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton wrote:
We have a bit of a problem wrt to upcalls that use call_usermodehelper
with containers and I'd like to bring this to some sort of resolution...

A particularly problematic case (though there are others) is the
nfsdcltrack upcall. It basically uses call_usermodehelper to run a
program in userland to track some information on stable storage for
nfsd.

I thought the discussion at the kernel summit about this issue was:
	- don't do this.
	- don't do it.
	- if you really need to do this, fix nfsd

Sorry, I couldn't make the kernel summit so I missed that discussion. I
guess LWN didn't cover it?

In any case, I guess then that we'll either have to come up with some
way to fix nfsd here, or simply ensure that nfsd can never be started
unless root in the container has a full set of a full set of
capabilities.

One sort of Rube Goldberg possibility to fix nfsd is:

- when we start nfsd in a container, fork off an extra kernel thread
    that just sits idle. That thread would need to be a descendant of the
    userland process that started nfsd, so we'd need to create it with
    kernel_thread().

- Have the kernel just start up the UMH program in the init_ns mount
    namespace as it currently does, but also pass the pid of the idle
    kernel thread to the UMH upcall.

- The program will then use /proc/<pid>/root and /proc/<pid>/ns/* to set
    itself up for doing things properly.

Note that with this mechanism we can't actually run a different binary
per container, but that's probably fine for most purposes.

Hmmm... Why we can't? We can go a bit further with userspace idea.

We use UMH some very limited number of user programs. For 2, actually:
1) /sbin/nfs_cache_getent
2) /sbin/nfsdcltrack

No, the kernel uses them for a lot more than that. Pretty much all of
the keys API upcalls use it. See all of the callers of
call_usermodehelper. All of them are running user binaries out of the
kernel, and almost all of them are certainly broken wrt containers.

If we convert them into proxies, which use /proc/<pid>/root and /proc/<pid>/ns/*, this will allow us to lookup the right binary.
The only limitation here is presence of this "proxy" binaries on "host".

Suppose I spawn my own container as a user, using all of this spiffy
new user namespace stuff. Then I make the kernel use
call_usermodehelper to call the upcall in the init_ns, and then trick
it into running my new "escape_from_namespace" program with "real" root
privileges.

I don't think we can reasonably assume that having the kernel exec an
arbitrary binary inside of a container is safe. Doing so inside of the
init_ns is marginally more safe, but only marginally so...

And we don't need any significant changes in kernel.

BTW, Jeff, could you remind me, please, why exactly we need to use UMH to run the binary?
What are this capabilities, which force us to do so?

Nothing _forces_ us to do so, but upcalls are very difficult to handle,
and UMH has a lot of advantages over a long-running daemon launched by
userland.

Originally, I created the nfsdcltrack upcall as a running daemon called
nfsdcld, and the kernel used rpc_pipefs to communicate with it.

Everyone hated it because no one likes to have to run daemons for
infrequently used upcalls. It's a pain for users to ensure that it's
running and it's a pain to handle when it isn't. So, I was encouraged
to turn that instead into a UMH upcall.

But leaving that aside, this problem is a lot larger than just nfsd. We
have a *lot* of UMH upcalls in the kernel, so this problem is more
general than just "fixing" nfsd's.

Ok. So we are talking about generic approach to UMH support in a container (and/or namespace).

Actually, as far as I can see, there are more that one aspect, which is not supported.
One one them is executing of the right binary. Another one is capabilities (and maybe there are more, like user namespaces), but I don't really care about them 
for now.
Executing the right binary, actually, is not about namespaces at all. This is about lookup implementation in VFS (do_execve_common).

Would be great to unshare FS for forked UHM kthread and swap it to desired root. This will solve the problem with proper lookup. However, as far as I 
understand, this approach is not welcome by the community.

This problem, probably, can be solved by constructing full binary path (i.e. not in a container, but in kernel thread root context) in UMH "init" callack. 
However, this will help only is the dentry is accessible from "init" root. Which is usually no true in case on mount namespaces, if I understand them right.

--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html