Hi, I've run into an issue where crash will enter an infinite loop while decoding exception stacks if those stacks get corrupted. We've seen this on four different systems where the hardware generated multiple NMIs and the second and subsequent NMIs caused the NMI exception stack to be overwritten. When this condition is hit, the bottom rsp on the NMI exception stack (which would normally point you back to the kernel thread stack or possibly a different exception stack) points you back into the middle of the same NMI exception stack. This causes crash to infinitely loop when it tries to decode that exception stack. Now clearly the root cause of the issue is faulty hardware that generated multiple NMIs. However a very small change in crash can detect this issue and stop the infinite loop from happening thereby allowing you to get to a point in crash where you can actually tell that it was an NMI that caused the system to dump. The patch is attached to this email. For x86_64 it will detect the condition of any exception stack that points back at itself. Please feel free to ask me any questions on this. Thanks, -Lucas
Attachment:
crash-5.0.5-estack_loop.patch
Description: crash-5.0.5-estack_loop.patch
-- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility