Re: LVM kernel lockup scenario during lvcreate

Laurence Oberman <loberman@xxxxxxxxxx> · Thu, 24 Aug 2023 16:19:51 -0400

On Thu, 2023-08-24 at 22:01 +0200, Jaco Kroon wrote:
> Hi,
> 
> On 2023/08/24 19:29, Laurence Oberman wrote:
> 
> > On Mon, 2023-06-12 at 11:40 -0700, Bart Van Assche wrote:
> > > On 6/9/23 00:29, Jaco Kroon wrote:
> > > > I'm attaching dmesg -T and ps axf.  dmesg in particular may
> > > > provide
> > > > clues as it provides a number of stack traces indicating
> > > > stalling
> > > > at
> > > > IO time.
> > > > 
> > > > Once this has triggered, even commands such as "lvs" goes into
> > > > uninterruptable wait, I unfortunately didn't test "dmsetup ls"
> > > > now
> > > > and triggered a reboot already (system needs to be up).
> > > To me the call traces suggest that an I/O request got stuck.
> > > Unfortunately call traces are not sufficient to identify the root
> > > cause
> > > in case I/O gets stuck. Has debugfs been mounted? If so, how
> > > about
> > > dumping the contents of /sys/kernel/debug/block/ into a tar file
> > > after
> > > the lockup has been reproduced and sharing that information?
> > > 
> > > tar -czf- -C /sys/kernel/debug/block . >block.tgz
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > > 
> > One I am aware of is this
> > commit 106397376c0369fcc01c58dd189ff925a2724a57
> > Author: David Jeffery <djeffery@xxxxxxxxxx>
> > 
> > Can we try get a vmcore (assuming its not a secure site)
> 
> Certainly.  Obviously on any host handling any kind of sensitive data
> there is a likelihood that sensitive data may be present in the
> vmcore, 
> as such I more than happy to create a vmcore, I'm assuming this will 
> create a kernel version of a core dump ... with 256GB of RAM (most of
> which goes towards disk caches) I'm further assuming this file can be
> potentially large.  Where will this get stored should the capture be 
> made?  (I need to ensure that the filesystem has sufficient storage 
> available)
> 
> > 
> > Add these to /etc/sysctl.conf
> > 
> > kernel.panic_on_io_nmi = 1
> > kernel.panic_on_unrecovered_nmi = 1
> > kernel.unknown_nmi_panic = 1
> > 
> > Run sysctl -p
> > Ensure kdump is running and can capture a vmcore
> Done.  Had to enable a few extra kernel options to get all the other 
> requirements, so scheduled a reboot to activate the new kernel. This 
> will happen on Saturday morning very early.
> > 
> > When it locks up again
> > send an NMI via the SuperMicro Web Managemnt interface
> 
> Possible to send from sysrq at the keyboard?  Otherwise I'll just
> need 
> to set up the RMI, will just be easier to do this from the keyboard
> if 
> possible, it's not always if it's left too late.
> 
> > 
> > Share the vmcore, or we can have you capture some specifics from it
> > to
> > triage.
> 
> I'd prefer you let me know what you need ... security concerns and
> all 
> ... frankly, I highly doubt there is any data that is really so 
> sensitive that it can be classified as "top secret" but we do have
> NDAs 
> in place prohibiting me from sharing anything that may potentially 
> contain customer related data ...
> 
> Kind regards,
> Jaco
> 

Hello, this would usually need an NMI sent from a management interface
as with it locked up no guarantee a sysrq c will get there from the
keyboard. 
You could try though.

As long as you have in /etc/kdump.conf 

path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31

This will get kernel only pages and would not be very big.

I could work with you privately to get what we need out of the vmcore
and we would avoid transferring it.

Thanks
Laurence