Re: [RFC] How to test panic handlers, without crashing the kernel

Jocelyn Falempe <jfalempe@xxxxxxxxxx> · Tue, 5 Mar 2024 17:52:40 +0100

On 05/03/2024 17:23, Michael Kelley wrote:
From: Guilherme G. Piccoli <gpiccoli@xxxxxxxxxx> Sent: Monday, March 4, 2024 1:43 PM

On 04/03/2024 18:12, John Ogness wrote:
[...]
The second question is how to simulate a panic context in a
non-destructive way, so we can test the panic notifiers in CI, without
crashing the machine.

I'm wondering if a "fake panic" can be implemented that quiesces all the
other CPUs via NMI (similar to kdb) and then calls the panic
notifiers. And finally releases everything back to normal. That might
produce a fairly realistic panic situation and should be fairly
non-destructive (depending on what the notifiers do and how long they
take).

Hi Jocelyn / John,

one concern here is that the panic notifiers are kind of a no man's
land, so we can have very simple / safe ones, while others are
destructive in nature.

An example of a good behaving notifier that is destructive is the
Hyper-V one, that destroys an essential host-guest interface (called
"vmbus connection"). What happens if we trigger this one just for
testing purposes in a debugfs interface? Likely the guest would die...

[+CCing Michael Kelley here since he seems interested in panic and is
also expert in Hyper-V, just in case my example is bogus.]

The Hyper-V example is valid. After hv_panic_vmbus_unload()
is called, the VM won't be able to do any disk, network, or graphics
frame buffer I/O. There's no recovery short of restarting the VM.

Thanks for the confirmation.

Michael

[I have retired from Microsoft.  I'm still occasionally contributing
to Linux kernel work with email mhklinux@xxxxxxxxxxx.]

So, maybe the problem could be split in 2: the non-notifiers portion of
the panic path, and the the notifiers; maybe restricting the notifiers
you'd run is a way to circumvent the risks, like if you could pass a
list of the specific notifiers you aim to test, this could be
interesting. Let's see what the others think and thanks for your work in
the DRM panic notifier =)

Or maybe have two lists of panic notifiers, the safe and the destructive 
list. So in case of fake panic, we can only call the safe notifiers.

Cheers,

Guilherme