Hi Eric, On Fri, Sep 29, 2023 at 12:01 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx> writes: > > > During kexec reboot, it is possible for a race to occur between > > device_shutdown() and userspace. This causes accesses to GPU after pm_runtime > > suspend has already happened. Fix this by calling freeze_processes() before > > device_shutdown(). > > Is there any reason why this same race with between sys_kexec and the > adreno_ioctl can not happen during a normal reboot? Thanks for the response. It can happen during a normal reboot. I think the reason it does not show up in the wild is because the "reboot" command implementation typically sends one of SIGSTOP or SIGKILL to all processes which effectively prevents the race. In any case, there is also a school of thought that says the kernel should be resilient to crashes and a userspace workaround involving sending signals could be looked at as papering over the real issue. I do sympathize/agree with that school of thought as well. > Is there any reason why there is not a .shutdown method to prevent the > race? > I would think the thing to do is to prevent this race in > kernel_restart_prepare or in the GPUs .shutdown method. As I don't see > anything that would prevent this during a normal reboot. What you're saying is essentially what I remember trying, the issue is not in the GPU driver but rather there the interconnect in the SoC shutdown and causes an "SError" exception if the CPU tries to access the memory locations, as also seen in the stack. I was not able to trace exactly when the interconnect becomes unavailable and perhaps there is a possibility of a more intricate fix where we can signal the GPU driver to not access the bus anymore, but my suspicion is that will add a lot of complexity and perhaps leave the door open to similar issues. > > Such freezing is already being done if kernel supports KEXEC_JUMP and > > kexec_image->preserve_context is true. However, doing it if either of these are > > not true prevents crashes/races. > > The KEXEC_JUMP case is something else entirely. It is supposed to work > like suspend to RAM. Maybe reboot should as well, but I am > uncomfortable making a generic device fix kexec specific. I see your point of view. I think regular reboot should also be fixed to avoid similar crash possibilities. I am happy to make a change for that similar to this patch if we want to proceed that way. Thoughts? thanks, - Joel _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec