On Tue, 12 Nov 2013 17:02:36 +0400 Stanislav Kinsbursky <skinsbursky@xxxxxxxxxxxxx> wrote: > 12.11.2013 15:12, Jeff Layton пишет: > > On Mon, 11 Nov 2013 16:47:03 -0800 > > Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > >> On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton wrote: > >>> We have a bit of a problem wrt to upcalls that use call_usermodehelper > >>> with containers and I'd like to bring this to some sort of resolution... > >>> > >>> A particularly problematic case (though there are others) is the > >>> nfsdcltrack upcall. It basically uses call_usermodehelper to run a > >>> program in userland to track some information on stable storage for > >>> nfsd. > >> > >> I thought the discussion at the kernel summit about this issue was: > >> - don't do this. > >> - don't do it. > >> - if you really need to do this, fix nfsd > >> > > > > Sorry, I couldn't make the kernel summit so I missed that discussion. I > > guess LWN didn't cover it? > > > > In any case, I guess then that we'll either have to come up with some > > way to fix nfsd here, or simply ensure that nfsd can never be started > > unless root in the container has a full set of a full set of > > capabilities. > > > > One sort of Rube Goldberg possibility to fix nfsd is: > > > > - when we start nfsd in a container, fork off an extra kernel thread > > that just sits idle. That thread would need to be a descendant of the > > userland process that started nfsd, so we'd need to create it with > > kernel_thread(). > > > > - Have the kernel just start up the UMH program in the init_ns mount > > namespace as it currently does, but also pass the pid of the idle > > kernel thread to the UMH upcall. > > > > - The program will then use /proc/<pid>/root and /proc/<pid>/ns/* to set > > itself up for doing things properly. > > > > Note that with this mechanism we can't actually run a different binary > > per container, but that's probably fine for most purposes. > > > > Hmmm... Why we can't? We can go a bit further with userspace idea. > > We use UMH some very limited number of user programs. For 2, actually: > 1) /sbin/nfs_cache_getent > 2) /sbin/nfsdcltrack > No, the kernel uses them for a lot more than that. Pretty much all of the keys API upcalls use it. See all of the callers of call_usermodehelper. All of them are running user binaries out of the kernel, and almost all of them are certainly broken wrt containers. > If we convert them into proxies, which use /proc/<pid>/root and /proc/<pid>/ns/*, this will allow us to lookup the right binary. > The only limitation here is presence of this "proxy" binaries on "host". > Suppose I spawn my own container as a user, using all of this spiffy new user namespace stuff. Then I make the kernel use call_usermodehelper to call the upcall in the init_ns, and then trick it into running my new "escape_from_namespace" program with "real" root privileges. I don't think we can reasonably assume that having the kernel exec an arbitrary binary inside of a container is safe. Doing so inside of the init_ns is marginally more safe, but only marginally so... > And we don't need any significant changes in kernel. > > BTW, Jeff, could you remind me, please, why exactly we need to use UMH to run the binary? > What are this capabilities, which force us to do so? > Nothing _forces_ us to do so, but upcalls are very difficult to handle, and UMH has a lot of advantages over a long-running daemon launched by userland. Originally, I created the nfsdcltrack upcall as a running daemon called nfsdcld, and the kernel used rpc_pipefs to communicate with it. Everyone hated it because no one likes to have to run daemons for infrequently used upcalls. It's a pain for users to ensure that it's running and it's a pain to handle when it isn't. So, I was encouraged to turn that instead into a UMH upcall. But leaving that aside, this problem is a lot larger than just nfsd. We have a *lot* of UMH upcalls in the kernel, so this problem is more general than just "fixing" nfsd's. -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html