Re: call_usermodehelper in containers

Stanislav Kinsburskiy <skinsbursky@xxxxxxxxxxxxx> · Sat, 13 Feb 2016 17:08:49 +0100

13.02.2016 00:39, Ian Kent пишет:
On Fri, 2013-11-15 at 15:54 +0400, Stanislav Kinsbursky wrote:
15.11.2013 15:03, Eric W. Biederman пишет:
Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> writes:

12.11.2013 17:30, Jeff Layton пишет:
On Tue, 12 Nov 2013 17:02:36 +0400
Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> wrote:

12.11.2013 15:12, Jeff Layton пишет:
On Mon, 11 Nov 2013 16:47:03 -0800
Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:

On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton
wrote:
We have a bit of a problem wrt to upcalls that use
call_usermodehelper
with containers and I'd like to bring this to some sort
of resolution...

A particularly problematic case (though there are
others) is the
nfsdcltrack upcall. It basically uses
call_usermodehelper to run a
program in userland to track some information on stable
storage for
nfsd.
I thought the discussion at the kernel summit about this
issue was:
	- don't do this.
	- don't do it.
	- if you really need to do this, fix nfsd

Sorry, I couldn't make the kernel summit so I missed that
discussion. I
guess LWN didn't cover it?

In any case, I guess then that we'll either have to come up
with some
way to fix nfsd here, or simply ensure that nfsd can never
be started
unless root in the container has a full set of a full set of
capabilities.

One sort of Rube Goldberg possibility to fix nfsd is:

- when we start nfsd in a container, fork off an extra
kernel thread
      that just sits idle. That thread would need to be a
descendant of the
      userland process that started nfsd, so we'd need to
create it with
      kernel_thread().

- Have the kernel just start up the UMH program in the
init_ns mount
      namespace as it currently does, but also pass the pid
of the idle
      kernel thread to the UMH upcall.

- The program will then use /proc/<pid>/root and
/proc/<pid>/ns/* to set
      itself up for doing things properly.

Note that with this mechanism we can't actually run a
different binary
per container, but that's probably fine for most purposes.

Hmmm... Why we can't? We can go a bit further with userspace
idea.

We use UMH some very limited number of user programs. For 2,
actually:
1) /sbin/nfs_cache_getent
2) /sbin/nfsdcltrack

No, the kernel uses them for a lot more than that. Pretty much
all of
the keys API upcalls use it. See all of the callers of
call_usermodehelper. All of them are running user binaries out
of the
kernel, and almost all of them are certainly broken wrt
containers.

If we convert them into proxies, which use /proc/<pid>/root
and /proc/<pid>/ns/*, this will allow us to lookup the right
binary.
The only limitation here is presence of this "proxy" binaries
on "host".

Suppose I spawn my own container as a user, using all of this
spiffy
new user namespace stuff. Then I make the kernel use
call_usermodehelper to call the upcall in the init_ns, and then
trick
it into running my new "escape_from_namespace" program with
"real" root
privileges.

I don't think we can reasonably assume that having the kernel
exec an
arbitrary binary inside of a container is safe. Doing so inside
of the
init_ns is marginally more safe, but only marginally so...

And we don't need any significant changes in kernel.

BTW, Jeff, could you remind me, please, why exactly we need to
use UMH to run the binary?
What are this capabilities, which force us to do so?

Nothing _forces_ us to do so, but upcalls are very difficult to
handle,
and UMH has a lot of advantages over a long-running daemon
launched by
userland.

Originally, I created the nfsdcltrack upcall as a running daemon
called
nfsdcld, and the kernel used rpc_pipefs to communicate with it.

Everyone hated it because no one likes to have to run daemons
for
infrequently used upcalls. It's a pain for users to ensure that
it's
running and it's a pain to handle when it isn't. So, I was
encouraged
to turn that instead into a UMH upcall.

But leaving that aside, this problem is a lot larger than just
nfsd. We
have a *lot* of UMH upcalls in the kernel, so this problem is
more
general than just "fixing" nfsd's.

Ok. So we are talking about generic approach to UMH support in a
container (and/or namespace).

Actually, as far as I can see, there are more that one aspect,
which is not supported.
One one them is executing of the right binary. Another one is
capabilities (and maybe there are more, like user namespaces), but
I
don't really care about them for now.
Executing the right binary, actually, is not about namespaces at
all. This is about lookup implementation in VFS
(do_execve_common).

Would be great to unshare FS for forked UHM kthread and swap it to
desired root. This will solve the problem with proper lookup.
However,
as far as I understand, this approach is not welcome by the
community.
I don't understand that one.  Having a preforked thread with the
proper
environment that can act like kthreadd in terms of spawning user
mode
helpers works and is simple.  The only downside I can see is that
there
is extra overhead.

What do you mean by "simple" here? Simple to implement?
We already have a preforked thread, called "UMH", used exactly for
this purpose.
Is there?

Can you explain how the pre-forking happens please?

AFAICS a workqueue is used to run UMH helpers, I can't see any pre
-forking going on there and it doesn't appear to be possible to do
either.

Hi Ian,
I'm not sure, I understand your question.
But there is a generic "khelper" thread, which is responsible for 
spawning new kthreads to execute some binary, requested by user.
IOW, when you want to use UMH, you add a request to "khelper" workqueue, 
which, in turn, creates another thread. The new one call init callback 
and does the actual execve call.

And, if I'm not mistaken, we are trying to discuss, how to adapt
existent infrastructure for namespaces, don't we?

Beyond that though for the user mode helpers spawned to populate
security keys it is not clear which context they should be run in,
even if we do have kernel threads.

Regardless of the context itself, we need a way to pass it to kernel
thread and to put kernel thread in this context. Or I'm missing
something?

This problem, probably, can be solved by constructing full binary
path
(i.e. not in a container, but in kernel thread root context) in
UMH
"init" callack. However, this will help only is the dentry is
accessible from "init" root. Which is usually no true in case on
mount
namespaces, if I understand them right.
You are correct it can not be assumed that what is visible in one
mount
namespace is visible in another.  And of course in addition to
picking
the correct binary to run you have to set up a proper environment
for
that binary to run in.  It may be that it's configuration file is
only
avaiable at the expected location in the proper mount namespace,
even
if the binary is available in all of the mount namespaces.

Yes, you are right. So, this solution can help only in case of very
specific and simple "environment-less" programs.
So, I believe, that we should modify UMH itself to support our needs.
But I don't see, how to make the idea more pleasant for the community.
IOW, when I was talking about UMH in NFS implementation on Ksummit,
Linus's answer was something like "fix NFS".
And I can't object it, actually, because for now NFS is the only
corner case.

Jeff said, that there are a bunch of UMH calls in kernel, but this is
not solid enough to prove UHM changes, since nobody is trying to use
them in containers.

So, I doubt, that we can change UMH generically without additional use
-cases for 'containerized" UMH.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html