Re: LVM kernel lockup scenario during lvcreate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Laurence,
One I am aware of is this
commit 106397376c0369fcc01c58dd189ff925a2724a57
Author: David Jeffery <djeffery@xxxxxxxxxx>

I should have held off on replying until I finished looking into this.  This looks very interesting indeed, that said, this is my first serious venture into the block layers of the kernel :), so the essay below is more for my understanding than anything else, would be great to have a better understanding of the underlying principles here and your feedback on my understanding thereof would be much appreciated.

If I understand this correctly (and that's a BIG IF) then it's possible that a bunch of IO requests goes into a wait queue for whatever reason (pending some other event?).  It's then possible that some of them should get woken up, and previously (prior to above) it could happen that only a single request gets woken up, and then that request would go straight back to the wait queue ... with the patch, isn't it still possible that all woken up requests could just go straight back to the wait queue (albeit less likely)?

Could the creation of a snapshot (which should based on my understanding of what a snapshot is block writes whilst the snapshot is being created, ie, make them go to the wait queue), and could it be that the process of setting up the snapshot (which itself involves writes) then potentially block due to this?  Ie, the write request that needs to get into the next batch to allow other writes to proceed gets blocked?

And as I write that it stops making sense to me because most likely the IO for creating a snapshot would only result in blocking writes to the LV, not to the underlying PVs which contains the metadata for the VG which is being updated.

But still ... if we think about this, the probability of that "bug" hitting would increase as the number of outstanding IO requests increase?  With iostat reporting r_await values upwards of 100 and w_await values periodically going up to 5000 (both generally in the 20-50 range for the last few minutes that I've been watching them), it would make sense for me that the number of requests blocking in-kernel could be much higher than that, it makes perfect sense for me that it could be related to this.  On the other hand, IIRC iostat -dmx 1 usually showed only minimal if any requests in either [rw]_await during lockups.

Consider the AHCI controller on the other hand where we've got 7200 RPM SATA drives which are slow to begin with, now we've got traditional snapshots, which are also causing an IO bottleneck and artificially raising IO demand (much more so than thin snaps, really wish I could figure out the migration process to convert this whole host to thin pools but lvconvert scares me something crazy), so now having that first snapshot causes IO bottleneck (ignoring relevant metadata updates, every write to a not yet duplicated segment becomes a read + write + write to clone the written to segment to the snapshot - thin pools just a read + write for same), so already IO is more demanding, and now we try to create another snapshot.

What if some IO fails to finish (due to continually being put back into the wait queue), thus blocking the process of creating the snapshot to begin with?

I know there are numerous other people using snapshots, but I've often wondered how many use it quite as heavily as we do on this specific host?  Given the massive amount of virtual machine infrastructure on the one hand I think there must be quite a lot, but then I also think many of them use "enterprise" (for whatever your definition of that is) storage or something like ceph, so not based on LVM.  And more and more so either SSD/flash or even NVMe, which given the faster response times would also lower the risks of IO related problems from showing themselves.

The risk seems to be during the creation of snapshots, so IO not making progress makes sense.

I've back-ported the referenced path onto 6.4.12 now, which will go alive Saturday morning.  Perhaps we'll be sorted now.  Will also revert to mq-deadline which has been shown to more regularly trigger this, so let's see.


Hello, this would usually need an NMI sent from a management interface
as with it locked up no guarantee a sysrq c will get there from the
keyboard.
You could try though.

As long as you have in /etc/kdump.conf

path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31

This will get kernel only pages and would not be very big.

I could work with you privately to get what we need out of the vmcore
and we would avoid transferring it.
Thanks.  This helps.  Let's get a core first (if it's going to happen again) and then take it from there.

Kind regards,
Jaco



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux