Albert Cahalan wrote: > On Mon, Jul 5, 2010 at 12:18 AM, Oren Laadan <orenl@xxxxxxxxxxxxxxx> wrote: >> Matt Helsley wrote: >>> On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: >>>> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu >>>> <sukadev@xxxxxxxxxxxxxxxxxx> wrote: > >> It follows that trying to set pid's in pid-namespaces _below_ you >> simply doesn't make sense (beyond the CLONE_NEWPID case). > > I may have some wrong ideas about how process restart works, > but I'd thought it would normally be done from above or from PID 1 > in the same pid namespace. > >> Finally, there have been objections before to allow pid-selection >> by non-privileged process. > > Eh, I dearly hope that privileged processes are generally not > even addressable (never mind creatable or accessable) from > inside anything other than the top-level pid namespace. > > Well, at least nothing should get more privilege than PID 1. > This would include having UID values that PID 1 can switch > to and having capability sets that PID 1 can switch to, and > any other (SE Linux, AppArmor, etc.) things too. > > Restarting a privileged process with a less privileged PID 1 > should result in privilege loss, and ought to require some sort of > "--force" option to ensure the person accepts possible breakage. > >>>>> +static int do_clone(int (*child_fn)(void *), void *child_arg, >>>>> + unsigned int flags_low, int nr_pids, pid_t *pids_list) >>>> There needs to be a way to pass child_fn and child_arg >>>> via the kernel. Besides being required for kernel-managed >>>> stacks, it's normally a saner interface. Stack setup would >>>> be much like the stack setup for signal handlers. Imagine >>> I'm inclined to say this is a bad idea. >>> >>> I didn't think we had "kernel-managed stacks" in mainline. The most we >>> have, to my knowledge, is the sigaltstack support and kernel threads. >>> >>> I don't see how being able to pass in child_fn and child_arg to the >>> kernel improves the sanity of the interface. If anything it will make >>> eclone even more exotic -- now at the end of the syscall we'll >>> need to mess with the registers/stack of the child much like when we're >>> invoking a signal handler. That just adds more arch-specific code than is >>> necessary. >>> >>> Userspace wrappers are perfectly capable of invoking the child function >>> and passing the arguments. Furthermore, passing those arguments requires >>> expanding the argument structure or putting even greater pressure on >>> registers (which, as you pointed out below, is an issue for vfork). > > BSD's rfork_thread has, among other things, these two arguments: > > int (*func)(void *arg) > void *arg > >>>> using this for a vfork-like interface that didn't have painful >>>> interactions with the compiler. >> Pardon my ignorance - what sort of painful interactions ? > > The child returns from vfork, via the same return address that > the parent will later use. (on the stack for many architectures) > The child then calls a function which might not have the same > stack layout as vfork, scrambling whatever may be on the stack > that the parent will be using to return from vfork. The parent may > then end up using a return address that has been corrupted. > To make this work, gcc actually recognizes vfork and has > special handling for it. I assumed that this is taken care of by libc rather than the compiler, like it is done for clone(2). Oren. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers