Re: LVM kernel lockup scenario during lvcreate

Jaco Kroon <jaco@xxxxxxxxx> · Fri, 25 Aug 2023 01:40:37 +0200

Hi Laurence,
One I am aware of is this
commit 106397376c0369fcc01c58dd189ff925a2724a57
Author: David Jeffery <djeffery@xxxxxxxxxx>

I should have held off on replying until I finished looking into this.  
This looks very interesting indeed, that said, this is my first serious 
venture into the block layers of the kernel :), so the essay below is 
more for my understanding than anything else, would be great to have a 
better understanding of the underlying principles here and your feedback 
on my understanding thereof would be much appreciated.

If I understand this correctly (and that's a BIG IF) then it's possible 
that a bunch of IO requests goes into a wait queue for whatever reason 
(pending some other event?).  It's then possible that some of them 
should get woken up, and previously (prior to above) it could happen 
that only a single request gets woken up, and then that request would go 
straight back to the wait queue ... with the patch, isn't it still 
possible that all woken up requests could just go straight back to the 
wait queue (albeit less likely)?

Could the creation of a snapshot (which should based on my understanding 
of what a snapshot is block writes whilst the snapshot is being created, 
ie, make them go to the wait queue), and could it be that the process of 
setting up the snapshot (which itself involves writes) then potentially 
block due to this?  Ie, the write request that needs to get into the 
next batch to allow other writes to proceed gets blocked?

And as I write that it stops making sense to me because most likely the 
IO for creating a snapshot would only result in blocking writes to the 
LV, not to the underlying PVs which contains the metadata for the VG 
which is being updated.

But still ... if we think about this, the probability of that "bug" 
hitting would increase as the number of outstanding IO requests 
increase?  With iostat reporting r_await values upwards of 100 and 
w_await values periodically going up to 5000 (both generally in the 
20-50 range for the last few minutes that I've been watching them), it 
would make sense for me that the number of requests blocking in-kernel 
could be much higher than that, it makes perfect sense for me that it 
could be related to this.  On the other hand, IIRC iostat -dmx 1 usually 
showed only minimal if any requests in either [rw]_await during lockups.

Consider the AHCI controller on the other hand where we've got 7200 RPM 
SATA drives which are slow to begin with, now we've got traditional 
snapshots, which are also causing an IO bottleneck and artificially 
raising IO demand (much more so than thin snaps, really wish I could 
figure out the migration process to convert this whole host to thin 
pools but lvconvert scares me something crazy), so now having that first 
snapshot causes IO bottleneck (ignoring relevant metadata updates, every 
write to a not yet duplicated segment becomes a read + write + write to 
clone the written to segment to the snapshot - thin pools just a read + 
write for same), so already IO is more demanding, and now we try to 
create another snapshot.

What if some IO fails to finish (due to continually being put back into 
the wait queue), thus blocking the process of creating the snapshot to 
begin with?

I know there are numerous other people using snapshots, but I've often 
wondered how many use it quite as heavily as we do on this specific 
host?  Given the massive amount of virtual machine infrastructure on the 
one hand I think there must be quite a lot, but then I also think many 
of them use "enterprise" (for whatever your definition of that is) 
storage or something like ceph, so not based on LVM.  And more and more 
so either SSD/flash or even NVMe, which given the faster response times 
would also lower the risks of IO related problems from showing themselves.

The risk seems to be during the creation of snapshots, so IO not making 
progress makes sense.

I've back-ported the referenced path onto 6.4.12 now, which will go 
alive Saturday morning.  Perhaps we'll be sorted now.  Will also revert 
to mq-deadline which has been shown to more regularly trigger this, so 
let's see.

Hello, this would usually need an NMI sent from a management interface
as with it locked up no guarantee a sysrq c will get there from the
keyboard.
You could try though.

As long as you have in /etc/kdump.conf

path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31

This will get kernel only pages and would not be very big.

I could work with you privately to get what we need out of the vmcore
and we would avoid transferring it.
Thanks.  This helps.  Let's get a core first (if it's going to happen 
again) and then take it from there.

Kind regards,
Jaco