On 01/13/2011 10:50 PM, Bruno PrÃmont wrote: > On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@xxxxxxx> wrote: > >> On 01/13/2011 09:09 PM, Bruno PrÃmont wrote: >>> On Thu, 13 January 2011 Daniel Lezcano<daniel.lezcano@xxxxxxx> wrote: >>>> in the container implementation, we are facing the problem of a process >>>> calling the sys_reboot syscall which of course makes the host to >>>> poweroff/reboot. >>>> >>>> If we drop the cap_sys_reboot capability, sys_reboot fails and the >>>> container reach a shutdown state but the init process stay there, hence >>>> the container becomes stuck waiting indefinitely the process '1' to exit. >>>> >>>> The current implementation to make the shutdown / reboot of the >>>> container to work is we watch, from a process outside of the container, >>>> the<rootfs>/var/run/utmp file and check the runlevel each time the file >>>> changes. When the 'reboot' or 'shutdown' level is detected, we wait for >>>> a single remaining in the container and then we kill it. >>>> >>>> That works but this is not efficient in case of a large number of >>>> containers as we will have to watch a lot of utmp files. In addition, >>>> the /var/run directory must *not* mounted as tmpfs in the distro. >>>> Unfortunately, it is the default setup on most of the distros and tends >>>> to generalize. That implies, the rootfs init's scripts must be modified >>>> for the container when we put in place its rootfs and as /var/run is >>>> supposed to be a tmpfs, most of the applications do not cleanup the >>>> directory, so we need to add extra services to wipeout the files. >>>> >>>> More problems arise when we do an upgrade of the distro inside the >>>> container, because all the setup we made at creation time will be lost. >>>> The upgrade overwrite the scripts, the fstab and so on. >>>> >>>> We did what was possible to solve the problem from userspace but we >>>> reach always a limit because there are different implementations of the >>>> 'init' process and the init's scripts differ from a distro to another >>>> and the same with the versions. >>>> >>>> We think this problem can only be solved from the kernel. >>>> >>>> The idea was to send a signal SIGPWR to the parent of the pid '1' of the >>>> pid namespace when the sys_reboot is called. Of course that won't occur >>>> for the init pid namespace. >>> Wouldn't sending SIGKILL to the pid '1' process of the originating PID >>> namespace be sufficient (that would trigger a SIGCHLD for the parent >>> process in the outer PID namespace. >> This is already the case. The question is : when do we send this signal ? >> We have to wait for the container system shutdown before killing it. > I meant that sys_reboot() would kill the namespace's init if it's not > called from boot namespace. > > See below > >>> (as far as I remember the PID namespace is killed when its 'init' exits, >>> if this is not the case all other processes in the given namespace would >>> have to be killed as well) >> Yes, absolutely but this is not the point, reaping the container is not >> a problem. >> >> What we are trying to achieve is to shutdown properly the container from >> inside (from outside will be possible too with the setns syscall). >> >> Assuming the process '1234' creates a new process in a new namespace set >> and wait for it. >> >> The new process '1' will exec /sbin/init and the system will boot up. >> But, when the system is shutdown or rebooted, after the down scripts are >> executed the kill -15 -1 will be invoked, killing all the processes >> expect the process '1' and the caller. This one will then call >> 'sys_reboot' and exit. Hence we still have the init process idle and its >> parent '1234' waiting for it to die. > This call to sys_reboot() would kill "new process '1'" instead of trying to > operate on the HW box. > This also has the advantage that a container would not require an informed > parent "monitoring" it from outside (though it would not be restarted even if > requested without such informed outside parent). Oh, ok. Sorry I misunderstood. Yes, that could be better than crossing the namespace boundaries. >> If we are able to receive the information in the process '1234' : "the >> sys_reboot was called in the child pid namespace", we can take then kill >> our child pid. If this information is raised via a signal sent by the >> kernel with the proper information in the siginfo_t (eg. si_code >> contains "LINUX_REBOOT_CMD_RESTART", "LINUX_REBOOT_CMD_HALT", ... ), the >> solution will be generic for all the shutdown/reboot of any kind of >> container and init version. > Could this be passed for a SIGCHLD? (when namespace is reaped, and received > by 1234 from above example assuming sys_reboot() kills the "new process '1'") Yes, that sounds a good idea. > Looks like yes, but with the need to define new values for si_code (reusing > LINUX_REBOOT_CMD_* would certainly clash, no matter which signal is choosen). CLD_REBOOT_CMD_RESTART CLD_REBOOT_CMD_HALT CLD_REBOOT_CMD_POWER_OFF CLD_REBOOT_CMD_RESTART2 (what about the cmd buffer, shall we ignore it ?) CLD_REBOOT_CMD_KEXEC (?) CLD_REBOOT_CMD_SW_SUSPEND (useful for the future checkpoint/restart) LINUX_REBOOT_CMD_CAD_ON and LINUX_REBOOT_CMD_CAD_OFF could be disabled for a non-init pid namespace, no ? _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers