On Wed, Oct 10, 2018 at 10:26:22AM -0700, Tycho Andersen wrote: > On Wed, Oct 10, 2018 at 07:15:02PM +0200, Christian Brauner wrote: > > On Wed, Oct 10, 2018 at 09:54:58AM -0700, Tycho Andersen wrote: > > > On Wed, Oct 10, 2018 at 05:39:57PM +0200, Christian Brauner wrote: > > > > On Wed, Oct 10, 2018 at 05:33:43PM +0200, Jann Horn wrote: > > > > > On Wed, Oct 10, 2018 at 5:32 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > > > > > On Tue, Oct 9, 2018 at 9:36 AM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > > > > +cc selinux people explicitly, since they probably have opinions on this > > > > > > > > > > > > I just spent about twenty minutes working my way through this thread, > > > > > > and digging through the containers archive trying to get a good > > > > > > understanding of what you guys are trying to do, and I'm not quite > > > > > > sure I understand it all. However, from what I have seen, this > > > > > > approach looks very ptrace-y to me (I imagine to others as well based > > > > > > on the comments) and because of this I think ensuring the usual ptrace > > > > > > access controls are evaluated, including the ptrace LSM hooks, is the > > > > > > right thing to do. > > > > > > > > > > Basically the problem is that this new ptrace() API does something > > > > > that doesn't just influence the target task, but also every other task > > > > > that has the same seccomp filter. So the classic ptrace check doesn't > > > > > work here. > > > > > > > > Just to throw this into the mix: then maybe ptrace() isn't the right > > > > interface and we should just go with the native seccomp() approach for > > > > now. > > > > > > Please no :). > > > > > > I don't buy your arguments that 3-syscalls vs. one is better. If I'm > > > doing this setup with a new container, I have to do > > > clone(CLONE_FILES), do this seccomp thing, so that my parent can pick > > > it up again, then do another clone without CLONE_FILES, because in the > > > general case I don't want to share my fd table with the container, > > > wait on the middle task for errors, etc. So we're still doing a bunch > > > of setup, and it feels more awkward than ptrace, with at least as many > > > syscalls, and it only works for your children. > > > > You're talking about the case where you already have shot yourself in > > the foot by blocking basically all other sensible ways of getting the fd > > out. > > Ok, but these other ways involve syscalls too (sendmsg() or whatever). > And if you're going to allow arbitrary policy from your users, you > have to be maximally flexible. So, I totally like the idea of being able to get an fd before the filter is active. If this could be done in seccomp()-only it would be A+ (See Andy's mail in the other thread.) But I really don't want to keep you working on this forever. :) > > > Also, this was meant to show that parts of your initial justification > > for implementing the ptrace() way of getting an fd doesn't really stand. > > And it doesn't really. Even with ptrace() you can get into situations > > where you're not able to get an fd. (see prior threads) > > Of course. I guess my point was that we shouldn't design an API that's > impossible to use. I'll drop the notes about sendmsg() from the commit > message. > > Tycho