Re: LVM kernel lockup scenario during lvcreate

Laurence Oberman <loberman@xxxxxxxxxx> · Fri, 25 Aug 2023 08:01:39 -0400

On Fri, 2023-08-25 at 01:40 +0200, Jaco Kroon wrote:
> Hi Laurence,
> > > > One I am aware of is this
> > > > commit 106397376c0369fcc01c58dd189ff925a2724a57
> > > > Author: David Jeffery <djeffery@xxxxxxxxxx>
> 
> I should have held off on replying until I finished looking into
> this.  
> This looks very interesting indeed, that said, this is my first
> serious 
> venture into the block layers of the kernel :), so the essay below is
> more for my understanding than anything else, would be great to have
> a 
> better understanding of the underlying principles here and your
> feedback 
> on my understanding thereof would be much appreciated.
> 
> If I understand this correctly (and that's a BIG IF) then it's
> possible 
> that a bunch of IO requests goes into a wait queue for whatever
> reason 
> (pending some other event?).  It's then possible that some of them 
> should get woken up, and previously (prior to above) it could happen 
> that only a single request gets woken up, and then that request would
> go 
> straight back to the wait queue ... with the patch, isn't it still 
> possible that all woken up requests could just go straight back to
> the 
> wait queue (albeit less likely)?
> 
> Could the creation of a snapshot (which should based on my
> understanding 
> of what a snapshot is block writes whilst the snapshot is being
> created, 
> ie, make them go to the wait queue), and could it be that the process
> of 
> setting up the snapshot (which itself involves writes) then
> potentially 
> block due to this?  Ie, the write request that needs to get into the 
> next batch to allow other writes to proceed gets blocked?
> 
> And as I write that it stops making sense to me because most likely
> the 
> IO for creating a snapshot would only result in blocking writes to
> the 
> LV, not to the underlying PVs which contains the metadata for the VG 
> which is being updated.
> 
> But still ... if we think about this, the probability of that "bug" 
> hitting would increase as the number of outstanding IO requests 
> increase?  With iostat reporting r_await values upwards of 100 and 
> w_await values periodically going up to 5000 (both generally in the 
> 20-50 range for the last few minutes that I've been watching them),
> it 
> would make sense for me that the number of requests blocking in-
> kernel 
> could be much higher than that, it makes perfect sense for me that it
> could be related to this.  On the other hand, IIRC iostat -dmx 1
> usually 
> showed only minimal if any requests in either [rw]_await during
> lockups.
> 
> Consider the AHCI controller on the other hand where we've got 7200
> RPM 
> SATA drives which are slow to begin with, now we've got traditional 
> snapshots, which are also causing an IO bottleneck and artificially 
> raising IO demand (much more so than thin snaps, really wish I could 
> figure out the migration process to convert this whole host to thin 
> pools but lvconvert scares me something crazy), so now having that
> first 
> snapshot causes IO bottleneck (ignoring relevant metadata updates,
> every 
> write to a not yet duplicated segment becomes a read + write + write
> to 
> clone the written to segment to the snapshot - thin pools just a read
> + 
> write for same), so already IO is more demanding, and now we try to 
> create another snapshot.
> 
> What if some IO fails to finish (due to continually being put back
> into 
> the wait queue), thus blocking the process of creating the snapshot
> to 
> begin with?
> 
> I know there are numerous other people using snapshots, but I've
> often 
> wondered how many use it quite as heavily as we do on this specific 
> host?  Given the massive amount of virtual machine infrastructure on
> the 
> one hand I think there must be quite a lot, but then I also think
> many 
> of them use "enterprise" (for whatever your definition of that is) 
> storage or something like ceph, so not based on LVM.  And more and
> more 
> so either SSD/flash or even NVMe, which given the faster response
> times 
> would also lower the risks of IO related problems from showing
> themselves.
> 
> The risk seems to be during the creation of snapshots, so IO not
> making 
> progress makes sense.
> 
> I've back-ported the referenced path onto 6.4.12 now, which will go 
> alive Saturday morning.  Perhaps we'll be sorted now.  Will also
> revert 
> to mq-deadline which has been shown to more regularly trigger this,
> so 
> let's see.
> 
> > 
> > Hello, this would usually need an NMI sent from a management
> > interface
> > as with it locked up no guarantee a sysrq c will get there from the
> > keyboard.
> > You could try though.
> > 
> > As long as you have in /etc/kdump.conf
> > 
> > path /var/crash
> > core_collector makedumpfile -l --message-level 7 -d 31
> > 
> > This will get kernel only pages and would not be very big.
> > 
> > I could work with you privately to get what we need out of the
> > vmcore
> > and we would avoid transferring it.
> Thanks.  This helps.  Let's get a core first (if it's going to happen
> again) and then take it from there.
> 
> Kind regards,
> Jaco
> 

Hello Jaco
These hangs usually require the stacks to see where and why we are
blocked. The vmcore will definitely help in that regard.

Regards
Laurence