Hi Petr! On Wed, 26 Feb 2025 16:49:26 +0100, Petr Mladek wrote: >On Sat 2025-02-22 14:44:05, Ryo Takakura wrote: >> On Fri, 21 Feb 2025 16:23:07 -0500, Hamza Mahfooz wrote: >> >On Fri, Feb 21, 2025 at 11:23:28AM +0900, Ryo Takakura wrote: >> >> On Thu, 20 Feb 2025 17:53:00 -0500, Hamza Mahfooz wrote: >> >> >Since, the panic handlers may require certain cpus to be online to panic >> >> >gracefully, we should call them before turning off SMP. Without this >> >> >re-ordering, on Hyper-V hv_panic_vmbus_unload() times out, because the >> >> >vmbus channel is bound to VMBUS_CONNECT_CPU and unless the crashing cpu >> >> >is the same as VMBUS_CONNECT_CPU, VMBUS_CONNECT_CPU will be offlined by >> >> >crash_smp_send_stop() before the vmbus channel can be deconstructed. >> >> > >> >> So maybe panic_other_cpus_shutdown() should be palced after >> >> atomic_notifier_call_chain() along with printk_legacy_allow_panic_sync() >> >> like below? >> >> >> >> ----- BEGIN ----- >> >> diff --git a/kernel/panic.c b/kernel/panic.c >> >> index d8635d5cecb2..7ac40e85ee27 100644 >> >> --- a/kernel/panic.c >> >> +++ b/kernel/panic.c >> >> @@ -372,16 +372,16 @@ void panic(const char *fmt, ...) >> >> if (!_crash_kexec_post_notifiers) >> >> __crash_kexec(NULL); >> >> >> >> - panic_other_cpus_shutdown(_crash_kexec_post_notifiers); >> >> - >> >> - printk_legacy_allow_panic_sync(); >> >> - >> >> /* >> >> * Run any panic handlers, including those that might need to >> >> * add information to the kmsg dump output. >> >> */ >> >> atomic_notifier_call_chain(&panic_notifier_list, 0, buf); >> >> >> >> + panic_other_cpus_shutdown(_crash_kexec_post_notifiers); >> >> + >> >> + printk_legacy_allow_panic_sync(); >> >> >> >> panic_print_sys_info(false); >> >> >> >> kmsg_dump_desc(KMSG_DUMP_PANIC, buf); >> >> ----- END ----- >> > >> >Ya, that looks fine to me, that's actually how I had it initally, but I >> >wasn't sure if it had to go before the panic handlers. So, I erred on >> >the side of caution. > >The ordering (stopping CPUs before allowing printk_legacy loop) >is important from the printk POV. So, keep it, please. Thanks for the check. >> I see, sorry that I was only speaking in relation to stored backtraces. >> It seems that printk_legacy_allow_panic_sync() is placed before >> atomic_notifier_call_chain() so that it can handle flushing before calling >> any panic handlers as described [0]. > >> [0] https://lore.kernel.org/lkml/ZeHSgZs9I3Ihvpye@alley/ > >> I'm not really familar with the problems associated with panic handlers >> so I hope maybe John and Petr can help on this matter... > >Honestly, I do not have much experience with failures of the panic >notifiers. But I saw a patchset which tried to add filtering of >some problematic ones, see >https://lore.kernel.org/lkml/20220108153451.195121-1-gpiccoli@xxxxxxxxxx/ > >I did not like the way of ad-hoc filtering. The right solution was to >fix the problematic notifiers. > >Anyway, it went out that the situation was not that easy. The notifiers >do various things. Some of them just printing extra information. Others >stopped or suspended some devices or services. Some should be called >before and some after crash_dump. > >The outcome was a monster-patchset which tried to fix some problematic >notifiers and split them into more notifier chains, see >https://lore.kernel.org/all/20220427224924.592546-1-gpiccoli@xxxxxxxxxx/ > >Some of the fixes were accepted but the split has never been done. I see. I went through some of the discussions on the thread [0] and I can see how complicated the subject is... >My opinion: > >1. The best solution would be to make the problematic notifier working > with stopped CPUs. The discussion around [v2] suggests that the author > made it working at least for x86_64, see > https://lore.kernel.org/r/20250221213055.133849-1-hamzamahfooz@xxxxxxxxxxxxxxxxxxx I agree. But I also like the below solution as well! >2. Another good solution might be to do the split of the notifier > chain, for an example, see > https://lore.kernel.org/lkml/Yn0TnsWVxCcdB2yO@alley/ > > The problematic notifier can be then added into a chain which > is called before stopping CPUs. Thanks for sharing this! Such an interesting discussion on what and when should be handled in panic path. I think I have a better picture of panic() now :). >3. In the worst case, you could change the ordering as proposed above. > I am just afraid that it might bring in new problems. There might > be notifiers which were not tested with more running CPUs... > > >In general, the system is in an unpredictable state when panic() is >called. Notifiers should not expect that non-panic CPUs will be >able to handle any requests. > >Also it looks like a good idea to stop non-panic CPUs as soon as possible. >Otherwise, they might create more harm than good. I see. I also think that 3. might introduce new problems... It goes against the general statements which you pointed out in which case its not really a problem of handler anymore. Sincerely, Ryo Takakura >Best Regards, >Petr [0] https://lore.kernel.org/lkml/20220427224924.592546-25-gpiccoli@xxxxxxxxxx/