On Thu, 2023-08-24 at 22:01 +0200, Jaco Kroon wrote: > Hi, > > On 2023/08/24 19:29, Laurence Oberman wrote: > > > On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote: > > > On 6/9/23 00:29, Jaco Kroon wrote: > > > > I'm attaching dmesg -T and ps axf. dmesg in particular may > > > > provide > > > > clues as it provides a number of stack traces indicating > > > > stalling > > > > at > > > > IO time. > > > > > > > > Once this has triggered, even commands such as "lvs" goes into > > > > uninterruptable wait, I unfortunately didn't test "dmsetup ls" > > > > now > > > > and triggered a reboot already (system needs to be up). > > > To me the call traces suggest that an I/O request got stuck. > > > Unfortunately call traces are not sufficient to identify the root > > > cause > > > in case I/O gets stuck. Has debugfs been mounted? If so, how > > > about > > > dumping the contents of /sys/kernel/debug/block/ into a tar file > > > after > > > the lockup has been reproduced and sharing that information? > > > > > > tar -czf- -C /sys/kernel/debug/block . >block.tgz > > > > > > Thanks, > > > > > > Bart. > > > > > One I am aware of is this > > commit 106397376c0369fcc01c58dd189ff925a2724a57 > > Author: David Jeffery <djeffery@xxxxxxxxxx> > > > > Can we try get a vmcore (assuming its not a secure site) > > Certainly. Obviously on any host handling any kind of sensitive data > there is a likelihood that sensitive data may be present in the > vmcore, > as such I more than happy to create a vmcore, I'm assuming this will > create a kernel version of a core dump ... with 256GB of RAM (most of > which goes towards disk caches) I'm further assuming this file can be > potentially large. Where will this get stored should the capture be > made? (I need to ensure that the filesystem has sufficient storage > available) > > > > > Add these to /etc/sysctl.conf > > > > kernel.panic_on_io_nmi = 1 > > kernel.panic_on_unrecovered_nmi = 1 > > kernel.unknown_nmi_panic = 1 > > > > Run sysctl -p > > Ensure kdump is running and can capture a vmcore > Done. Had to enable a few extra kernel options to get all the other > requirements, so scheduled a reboot to activate the new kernel. This > will happen on Saturday morning very early. > > > > When it locks up again > > send an NMI via the SuperMicro Web Managemnt interface > > Possible to send from sysrq at the keyboard? Otherwise I'll just > need > to set up the RMI, will just be easier to do this from the keyboard > if > possible, it's not always if it's left too late. > > > > > Share the vmcore, or we can have you capture some specifics from it > > to > > triage. > > I'd prefer you let me know what you need ... security concerns and > all > ... frankly, I highly doubt there is any data that is really so > sensitive that it can be classified as "top secret" but we do have > NDAs > in place prohibiting me from sharing anything that may potentially > contain customer related data ... > > Kind regards, > Jaco > Hello, this would usually need an NMI sent from a management interface as with it locked up no guarantee a sysrq c will get there from the keyboard. You could try though. As long as you have in /etc/kdump.conf path /var/crash core_collector makedumpfile -l --message-level 7 -d 31 This will get kernel only pages and would not be very big. I could work with you privately to get what we need out of the vmcore and we would avoid transferring it. Thanks Laurence