Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> writes: > On Mon, Oct 9, 2023 at 11:30 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >> >> Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> writes: >> >> > On Mon, Oct 2, 2023 at 2:18 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: >> > [..] >> >> > > Such freezing is already being done if kernel supports KEXEC_JUMP and >> >> > > kexec_image->preserve_context is true. However, doing it if either of these are >> >> > > not true prevents crashes/races. >> >> > >> >> > The KEXEC_JUMP case is something else entirely. It is supposed to work >> >> > like suspend to RAM. Maybe reboot should as well, but I am >> >> > uncomfortable making a generic device fix kexec specific. >> >> >> >> I see your point of view. I think regular reboot should also be fixed >> >> to avoid similar crash possibilities. I am happy to make a change for >> >> that similar to this patch if we want to proceed that way. >> >> >> >> Thoughts? >> > >> > Just checking how we want to proceed, is the consensus that we should >> > prevent kernel crashes without relying on userspace stopping all >> > processes? Should we fix regular reboot syscall as well and not just >> > kexec reboot? >> >> It just occurred to me there is something very fishy about all of this. >> >> What userspace do you have using kexec (not kexec on panic) that doesn't >> preform the same userspace shutdown as a normal reboot? >> >> Quite frankly such a userspace is buggy, and arguably that is where you >> should start fixing things. > > It is a simple unit test that tests kexec support by kexec-rebooting > the kernel. I don't think SIGSTOP/SIGKILL'ing during kexec-reboot is > ideal because in a real panic-on-kexec type crash, that may not happen > and so does not emulate the real world that well. I think we want the > kexec-reboot to do a *reboot* without crashing the kernel while doing > so. Ricardo/Steve can chime on what they feel as well. This is confusing. You have a unit test that, that tests kexec on panic using a the full kexec reboot. The two are fundamentally similar but you aren't going to have a valid test case if you mix them. There is a whole kernel module that tests more interesting cases, for the simple case you probably just want to do: echo 'p' > /proc/sysrq-trigger At least I think it is p that causes a kernel-panic. That will ensure you are exercising the kexec on panic code path. That performs the minimal shutdown in the kernel. >> That way you can get the orderly shutdown >> of userspace daemons/services along with an orderly shutdown of >> everything the kernel is responsible for. > > Fixing in userspace is an option but people are not happy that the > kernel can crash like that. In a kexec on panic scenario the kernel needs to perform that absolute bare essential shutdown before calling kexec (basically nothing). During kexec-on-panic nothing can be relied upon because we don't know what is broken. If that is what you care about (as suggested by the unit test) you need to fix the device initialization. In a normal kexec scenario the whole normal reboot process is expected. I have no problems with fixing the kernel to handle that scenario, but in the real world the entire orderly shutdown both, kernel and userspace should be performed. >> At the kernel level a kexec reboot and a normal reboot have been >> deliberately kept as close as possible. Which is why I say we should >> fix it in reboot. > > You mean fix it in userspace? No. I mean in the kernel the orderly shutdown for a kexec reboot and an ordinary reboot are kept as close to the same as possible. It should be the case that the only differences between the two is that in once case system firmware takes over after the orderly shutdown, and in the other case a new kernel takes over after the orderly shutdown. Eric _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec