On Fri, 2023-08-25 at 01:40 +0200, Jaco Kroon wrote: > Hi Laurence, > > > > One I am aware of is this > > > > commit 106397376c0369fcc01c58dd189ff925a2724a57 > > > > Author: David Jeffery <djeffery@xxxxxxxxxx> > > I should have held off on replying until I finished looking into > this. > This looks very interesting indeed, that said, this is my first > serious > venture into the block layers of the kernel :), so the essay below is > more for my understanding than anything else, would be great to have > a > better understanding of the underlying principles here and your > feedback > on my understanding thereof would be much appreciated. > > If I understand this correctly (and that's a BIG IF) then it's > possible > that a bunch of IO requests goes into a wait queue for whatever > reason > (pending some other event?). It's then possible that some of them > should get woken up, and previously (prior to above) it could happen > that only a single request gets woken up, and then that request would > go > straight back to the wait queue ... with the patch, isn't it still > possible that all woken up requests could just go straight back to > the > wait queue (albeit less likely)? > > Could the creation of a snapshot (which should based on my > understanding > of what a snapshot is block writes whilst the snapshot is being > created, > ie, make them go to the wait queue), and could it be that the process > of > setting up the snapshot (which itself involves writes) then > potentially > block due to this? Ie, the write request that needs to get into the > next batch to allow other writes to proceed gets blocked? > > And as I write that it stops making sense to me because most likely > the > IO for creating a snapshot would only result in blocking writes to > the > LV, not to the underlying PVs which contains the metadata for the VG > which is being updated. > > But still ... if we think about this, the probability of that "bug" > hitting would increase as the number of outstanding IO requests > increase? With iostat reporting r_await values upwards of 100 and > w_await values periodically going up to 5000 (both generally in the > 20-50 range for the last few minutes that I've been watching them), > it > would make sense for me that the number of requests blocking in- > kernel > could be much higher than that, it makes perfect sense for me that it > could be related to this. On the other hand, IIRC iostat -dmx 1 > usually > showed only minimal if any requests in either [rw]_await during > lockups. > > Consider the AHCI controller on the other hand where we've got 7200 > RPM > SATA drives which are slow to begin with, now we've got traditional > snapshots, which are also causing an IO bottleneck and artificially > raising IO demand (much more so than thin snaps, really wish I could > figure out the migration process to convert this whole host to thin > pools but lvconvert scares me something crazy), so now having that > first > snapshot causes IO bottleneck (ignoring relevant metadata updates, > every > write to a not yet duplicated segment becomes a read + write + write > to > clone the written to segment to the snapshot - thin pools just a read > + > write for same), so already IO is more demanding, and now we try to > create another snapshot. > > What if some IO fails to finish (due to continually being put back > into > the wait queue), thus blocking the process of creating the snapshot > to > begin with? > > I know there are numerous other people using snapshots, but I've > often > wondered how many use it quite as heavily as we do on this specific > host? Given the massive amount of virtual machine infrastructure on > the > one hand I think there must be quite a lot, but then I also think > many > of them use "enterprise" (for whatever your definition of that is) > storage or something like ceph, so not based on LVM. And more and > more > so either SSD/flash or even NVMe, which given the faster response > times > would also lower the risks of IO related problems from showing > themselves. > > The risk seems to be during the creation of snapshots, so IO not > making > progress makes sense. > > I've back-ported the referenced path onto 6.4.12 now, which will go > alive Saturday morning. Perhaps we'll be sorted now. Will also > revert > to mq-deadline which has been shown to more regularly trigger this, > so > let's see. > > > > > Hello, this would usually need an NMI sent from a management > > interface > > as with it locked up no guarantee a sysrq c will get there from the > > keyboard. > > You could try though. > > > > As long as you have in /etc/kdump.conf > > > > path /var/crash > > core_collector makedumpfile -l --message-level 7 -d 31 > > > > This will get kernel only pages and would not be very big. > > > > I could work with you privately to get what we need out of the > > vmcore > > and we would avoid transferring it. > Thanks. This helps. Let's get a core first (if it's going to happen > again) and then take it from there. > > Kind regards, > Jaco > Hello Jaco These hangs usually require the stacks to see where and why we are blocked. The vmcore will definitely help in that regard. Regards Laurence