On Thu, 13 January 2011 Daniel Lezcano <daniel.lezcano@xxxxxxx> wrote: > On 01/13/2011 10:50 PM, Bruno PrÃmont wrote: > > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@xxxxxxx> wrote: > >> On 01/13/2011 09:09 PM, Bruno PrÃmont wrote: > >>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@xxxxxxx> wrote: > >>>> in the container implementation, we are facing the problem of a process > >>>> calling the sys_reboot syscall which of course makes the host to > >>>> poweroff/reboot. > >>>> > >>>> If we drop the cap_sys_reboot capability, sys_reboot fails and the > >>>> container reach a shutdown state but the init process stay there, hence > >>>> the container becomes stuck waiting indefinitely the process '1' to exit. > >>>> > >>>> The current implementation to make the shutdown / reboot of the > >>>> container to work is we watch, from a process outside of the container, > >>>> the<rootfs>/var/run/utmp file and check the runlevel each time the file > >>>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for > >>>> a single remaining in the container and then we kill it. > >>>> > >>>> That works but this is not efficient in case of a large number of > >>>> containers as we will have to watch a lot of utmp files. In addition, > >>>> the /var/run directory must *not* mounted as tmpfs in the distro. > >>>> Unfortunately, it is the default setup on most of the distros and tends > >>>> to generalize. That implies, the rootfs init's scripts must be modified > >>>> for the container when we put in place its rootfs and as /var/run is > >>>> supposed to be a tmpfs, most of the applications do not cleanup the > >>>> directory, so we need to add extra services to wipeout the files. > >>>> > >>>> More problems arise when we do an upgrade of the distro inside the > >>>> container, because all the setup we made at creation time will be lost. > >>>> The upgrade overwrite the scripts, the fstab and so on. > >>>> > >>>> We did what was possible to solve the problem from userspace but we > >>>> reach always a limit because there are different implementations of the > >>>> 'init' process and the init's scripts differ from a distro to another > >>>> and the same with the versions. > >>>> > >>>> We think this problem can only be solved from the kernel. > >>>> > >>>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the > >>>> pid namespace when the sys_reboot is called. Of course that won't occur > >>>> for the init pid namespace. > >>> Wouldn't sending SIGKILL to the pid '1' process of the originating PID > >>> namespace be sufficient (that would trigger a SIGCHLD for the parent > >>> process in the outer PID namespace. > >> This is already the case. The question is : when do we send this signal ? > >> We have to wait for the container system shutdown before killing it. > > I meant that sys_reboot() would kill the namespace's init if it's not > > called from boot namespace. > > > > See below > > > >>> (as far as I remember the PID namespace is killed when its 'init' exits, > >>> if this is not the case all other processes in the given namespace would > >>> have to be killed as well) > >> Yes, absolutely but this is not the point, reaping the container is not > >> a problem. > >> > >> What we are trying to achieve is to shutdown properly the container from > >> inside (from outside will be possible too with the setns syscall). > >> > >> Assuming the process '1234' creates a new process in a new namespace set > >> and wait for it. > >> > >> The new process '1' will exec /sbin/init and the system will boot up. > >> But, when the system is shutdown or rebooted, after the down scripts are > >> executed the kill -15 -1 will be invoked, killing all the processes > >> expect the process '1' and the caller. This one will then call > >> 'sys_reboot' and exit. Hence we still have the init process idle and its > >> parent '1234' waiting for it to die. > > This call to sys_reboot() would kill "new process '1'" instead of trying to > > operate on the HW box. > > This also has the advantage that a container would not require an informed > > parent "monitoring" it from outside (though it would not be restarted even if > > requested without such informed outside parent). > > Oh, ok. Sorry I misunderstood. > > Yes, that could be better than crossing the namespace boundaries. > > >> If we are able to receive the information in the process '1234' : "the > >> sys_reboot was called in the child pid namespace", we can take then kill > >> our child pid. If this information is raised via a signal sent by the > >> kernel with the proper information in the siginfo_t (eg. si_code > >> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the > >> solution will be generic for all the shutdown/reboot of any kind of > >> container and init version. > > Could this be passed for a SIGCHLD? (when namespace is reaped, and received > > by 1234 from above example assuming sys_reboot() kills the "new process '1'") > > Yes, that sounds a good idea. > > > Looks like yes, but with the need to define new values for si_code (reusing > > LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen). > > CLD_REBOOT_CMD_RESTART > CLD_REBOOT_CMD_HALT > CLD_REBOOT_CMD_POWER_OFF I would just map both to the same thing... > CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?) The cmd buffer could be passed via si_ptr if we want it, otherwise it would be the same as for CLD_REBOOT_CMD_RESTART (which would have si_ptr set to NULL in case no si_code differentiation is needed) > CLD_REBOOT_CMD_KEXEC (?) I don't think kexec makes any sense inside a container, such a sys_reboot() call should probably fail or fallback to _RESTART > CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart) Looks reasonable > LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF could be disabled > for a non-init pid namespace, no ? I haven't looked at how/when the state set by these is checked, but it could keep its meaning and a CAD shortcut would act on the container to which the active task on the given tty belongs. (so as if the process which would have gotten SIGINT had issued sys_reboot(LINUX_REBOOT_CMD_RESTART), permissions set aside) Bruno _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers