[Added printk maintainer and kdb folks] Hi Jocelyn, On 2024-03-01, Jocelyn Falempe <jfalempe@xxxxxxxxxx> wrote: > While writing a panic handler for drm devices [1], I needed a way to > test it without crashing the machine. > So from debugfs, I called > atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the > side effect of calling all other panic notifiers registered. > > So Sima suggested to move that to the generic panic code, and test all > panic notifiers with a dedicated debugfs interface. > > I can move that code to kernel/, but before doing that, I would like to > know if you think that's the right way to test the panic code. One major event that happens before the panic notifiers is panic_other_cpus_shutdown(). This can cause special situations because CPUs can be stopped while holding resources (such as raw spin locks). And these are the situations that make it so tricky to have safe and reliable notifiers. If triggered from debugfs, these situations will never occur. My concern is that the tests via debugfs will always succeed, but in the real world panic notifiers are failing/hanging/exploding. IMHO useful panic testing requires real panic'ing. For my printk panic tests I trigger unknown NMIs while booting with "unknown_nmi_panic". Particularly with Qemu this is quite easy and amazingly effective at catching problems. In fact, a recent printk series [0] fixed seven issues that were found through this method of panic testing. > The second question is how to simulate a panic context in a > non-destructive way, so we can test the panic notifiers in CI, without > crashing the machine. I'm wondering if a "fake panic" can be implemented that quiesces all the other CPUs via NMI (similar to kdb) and then calls the panic notifiers. And finally releases everything back to normal. That might produce a fairly realistic panic situation and should be fairly non-destructive (depending on what the notifiers do and how long they take). > The worst case for a panic notifier, is when the panic occurs in NMI > context, but I don't know how to simulate that. The goal would be to > find early if a panic notifier tries to sleep, or do other things that > are not allowed in a panic context. Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers the fake panic instead? John Ogness [0] https://lore.kernel.org/lkml/20240207134103.1357162-1-john.ogness@xxxxxxxxxxxxx