On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme