Re: [PATCH v1 2/2] signal: add procfd_signal() syscall

"Serge E. Hallyn" <serge@xxxxxxxxxx> · Wed, 21 Nov 2018 15:39:46 -0600

On Tue, Nov 20, 2018 at 11:31:13AM +0100, Christian Brauner wrote:
> On Mon, Nov 19, 2018 at 10:59:12PM -0600, Eric W. Biederman wrote:
> > Daniel Colascione <dancol@xxxxxxxxxx> writes:
> > 
> > > On Mon, Nov 19, 2018 at 1:37 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > >>
> > >> On Mon, Nov 19, 2018 at 01:26:22PM -0800, Daniel Colascione wrote:
> > >> > On Mon, Nov 19, 2018 at 1:21 PM, Christian Brauner <christian@xxxxxxxxxx> wrote:
> > >> > > That can be done without a loop by comparing the level counter for the
> > >> > > two pid namespaces.
> > >> > >
> > >> > >>
> > >> > >> And you can rewrite pidns_get_parent to use it. So you would instead be
> > >> > >> doing:
> > >> > >>
> > >> > >>     if (pidns_is_descendant(proc_pid_ns, task_active_pid_ns(current)))
> > >> > >>         return -EPERM;
> > >> > >>
> > >> > >> (Or you can just copy the 5-line loop into procfd_signal -- though I
> > >> > >> imagine we'll need this for all of the procfd_* APIs.)
> > >> >
> > >> > Why is any of this even necessary? Why does the child namespace we're
> > >> > considering even have a file descriptor to its ancestor's procfs? If
> > >>
> > >> Because you can send file descriptors between processes and container
> > >> runtimes tend to do that.
> > >
> > > Right. But why *would* a container runtime send one of these procfs
> > > FDs to a container?
> > >
> > >> > it has one of these FDs, it can already *read* all sorts of
> > >> > information it really shouldn't be able to acquire, so the additional
> > >> > ability to send a signal (subject to the usual permission checks)
> > >> > feels like sticking a finger in a dike that's already well-perforated.
> > >> > IMHO, we shouldn't bother with this check. The patch would be simpler
> > >> > without it.
> > >>
> > >> We will definitely not allow signaling processes in an ancestor pid
> > >> namespace! That is a security issue! I can imagine container runtimes
> > >> killing their monitoring process etc. pp. Not happening, unless someone
> > >> with deep expertise in signals can convince me otherwise.
> > >
> > > If parent namespace procfs FDs or mounts really can leak into child
> > > namespaces as easily as Aleksa says, then I don't mind adding the
> > > check. I was under the impression that if you find yourself in this
> > > situation, you already have a big problem.
> > 
> > There is one big reason to have the check, and I have not seen it
> > mentioned yet in this thread.
> > 
> > When SI_USER is set we report the pid of the sender of the signal in
> > si_pid.  When the signal comes from the kernel si_pid == 0.  When signal
> > is sent from an ancestor pid namespace si_pid also equals 0 (which is
> > reasonable).
> > 
> > A signal out to a process in a parent pid namespace such as SIGCHLD is
> > reasonable as we can map the pid.  I really don't see the point of
> > forbidding that.  From the perspective of the process in the parent pid
> > namespace it is just another process in it's pid namespace.  So it
> > should pose no problem from the perspective of the receiving process.
> > 
> > A signal to a process in a pid namespace that is neither a parent nor a
> > descendent pid namespace would be a problem, as there is no well defined
> > notion of what si_pid should be set to.  So for that case perhaps we
> > should have something like a noprocess pid that we can set.  Perhaps we
> > could set si_pid to 0xffffffff.  That would take a small extension to
> > pid_nr_ns.
> > 
> > File descriptors are not namespaced.  It is completely legitimate to use
> > file descriptors to get around limitations of namespaces.
> 
> Frankly, I don't see a good argument for why we would allow that even if
> safe. I have not heard a legitimate use-case or need for this.
> At this point I care about very simple semantics. Being able to signal
> into ancestor pid namespaces and cousin namespaces is interesting but
> makes the syscall more brittle and harder to understand.

Yeah, I'm with you on that.  We can always open that door later if a good
use case comes up, but I prefer simple at first.

> Changing pid_nr_ns() might be the solution but this function is called
> all over the place in the kernel and I'm not going to risk breaking
> something by changing it for a feature that no one so far has ever
> asked for.
> If you are ok with this then we should hold off on this. We can always
> add this feature later by removing the check when someone has a use-case
> for it.
> I'll send a v2 of the patch that keeps the restriction for now. If you
> insist on it being removed we can make the change in a follow-up
> iteration.
> 
> Christian
> 
> > 
> > Adding limitations to a file descriptor based api because someone else
> > can't set up their processes in such a way as to get the restrictions
> > they are looking for seems very sad.
> > 
> > Frankly I think it is one of the better features of namespaces that we
> > have to carefully handle and define these cases so that when the
> > inevitable leaks happen you are not immediately in a world of hurt.  All
> > of the other permission checks etc continue to do their job.  Plus you
> > are prepared for the case when someone wants their containers to have an
> > interesting communication primitive.
> > 
> > Eric
> > 
> > 
> > 
> >