Re: [RFC] How to test panic handlers, without crashing the kernel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 04/03/2024 22:12, John Ogness wrote:
[Added printk maintainer and kdb folks]

Hi Jocelyn,

On 2024-03-01, Jocelyn Falempe <jfalempe@xxxxxxxxxx> wrote:
While writing a panic handler for drm devices [1], I needed a way to
test it without crashing the machine.
So from debugfs, I called
atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
side effect of calling all other panic notifiers registered.

So Sima suggested to move that to the generic panic code, and test all
panic notifiers with a dedicated debugfs interface.

I can move that code to kernel/, but before doing that, I would like to
know if you think that's the right way to test the panic code.

One major event that happens before the panic notifiers is
panic_other_cpus_shutdown(). This can cause special situations because
CPUs can be stopped while holding resources (such as raw spin
locks). And these are the situations that make it so tricky to have safe
and reliable notifiers. If triggered from debugfs, these situations will
never occur.

My concern is that the tests via debugfs will always succeed, but in the
real world panic notifiers are failing/hanging/exploding. IMHO useful
panic testing requires real panic'ing.

Yes, but for the drm panic, it's still useful to check that the output is working (ie: make sure the color format and the framebuffer address are good). Also I've reworked the debugfs patch, so I don't have to call all panic notifiers. It's now per device, so your can trigger the drm_panic handler on a specific GPU.


For my printk panic tests I trigger unknown NMIs while booting with
"unknown_nmi_panic". Particularly with Qemu this is quite easy and
amazingly effective at catching problems. In fact, a recent printk
series [0] fixed seven issues that were found through this method of
panic testing.

Thanks for this tip, I used to test with "echo c > /proc/sysrq-trigger" in the guest, but that's more permissive. I'm now testing with virsh inject-nmi, and drm_panic is still working.

The second question is how to simulate a panic context in a
non-destructive way, so we can test the panic notifiers in CI, without
crashing the machine.

I'm wondering if a "fake panic" can be implemented that quiesces all the
other CPUs via NMI (similar to kdb) and then calls the panic
notifiers. And finally releases everything back to normal. That might
produce a fairly realistic panic situation and should be fairly
non-destructive (depending on what the notifiers do and how long they
take).

The worst case for a panic notifier, is when the panic occurs in NMI
context, but I don't know how to simulate that. The goal would be to
find early if a panic notifier tries to sleep, or do other things that
are not allowed in a panic context.

Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers
the fake panic instead?

John Ogness

[0] https://lore.kernel.org/lkml/20240207134103.1357162-1-john.ogness@xxxxxxxxxxxxx


Best regards,

--

Jocelyn




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux