Hi,
On 2023/08/24 19:29, Laurence Oberman wrote:
On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote:
On 6/9/23 00:29, Jaco Kroon wrote:
I'm attaching dmesg -T and ps axf. dmesg in particular may provide
clues as it provides a number of stack traces indicating stalling
at
IO time.
Once this has triggered, even commands such as "lvs" goes into
uninterruptable wait, I unfortunately didn't test "dmsetup ls" now
and triggered a reboot already (system needs to be up).
To me the call traces suggest that an I/O request got stuck.
Unfortunately call traces are not sufficient to identify the root
cause
in case I/O gets stuck. Has debugfs been mounted? If so, how about
dumping the contents of /sys/kernel/debug/block/ into a tar file
after
the lockup has been reproduced and sharing that information?
tar -czf- -C /sys/kernel/debug/block . >block.tgz
Thanks,
Bart.
One I am aware of is this
commit 106397376c0369fcc01c58dd189ff925a2724a57
Author: David Jeffery <djeffery@xxxxxxxxxx>
Can we try get a vmcore (assuming its not a secure site)
Certainly. Obviously on any host handling any kind of sensitive data
there is a likelihood that sensitive data may be present in the vmcore,
as such I more than happy to create a vmcore, I'm assuming this will
create a kernel version of a core dump ... with 256GB of RAM (most of
which goes towards disk caches) I'm further assuming this file can be
potentially large. Where will this get stored should the capture be
made? (I need to ensure that the filesystem has sufficient storage
available)
Add these to /etc/sysctl.conf
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1
Run sysctl -p
Ensure kdump is running and can capture a vmcore
Done. Had to enable a few extra kernel options to get all the other
requirements, so scheduled a reboot to activate the new kernel. This
will happen on Saturday morning very early.
When it locks up again
send an NMI via the SuperMicro Web Managemnt interface
Possible to send from sysrq at the keyboard? Otherwise I'll just need
to set up the RMI, will just be easier to do this from the keyboard if
possible, it's not always if it's left too late.
Share the vmcore, or we can have you capture some specifics from it to
triage.
I'd prefer you let me know what you need ... security concerns and all
... frankly, I highly doubt there is any data that is really so
sensitive that it can be classified as "top secret" but we do have NDAs
in place prohibiting me from sharing anything that may potentially
contain customer related data ...
Kind regards,
Jaco