Re: LVM kernel lockup scenario during lvcreate

Jaco Kroon <jaco@xxxxxxxxx> · Thu, 24 Aug 2023 22:16:42 +0200

Hi,

On 2023/08/24 19:13, Bart Van Assche wrote:
On 8/24/23 00:29, Jaco Kroon wrote:
We're definitely seeing the same thing on another host using an ahci 
controller.  This seems to hint that it's not a firmware issue, as 
does the fact that this happens much less frequently with the none 
scheduler.

That is unexpected. I don't think there is enough data available yet to
conclude whether these issues are identical or not?
It's hard for me to even conclude that two consecutive crashes are even 
exactly the same issue ... however, there's strong correlation in that 
there generally are lvcreate commands in D state which to me hints that 
it's something to do with LVM snapshot creation (both traditional - ahci 
controller, and thing - super micro).

I will make a plan to action the firmware updates on the raid 
controller over the weekend regardless, just in order to eliminate 
that.  I will then revert to mq-deadline. Assuming this does NOT fix 
it, how would I go about assessing if this is a controller firmware 
issue or a Linux kernel issue?

If the root cause would be an issue in the mq-deadline scheduler or in
the core block layer then there would be many more reports about I/O
lockups. For this case I think that it's very likely that the root 
cause is either the I/O controller driver or the I/O controller firmware.

I tend to agree with that.  And given the fact that we probably have in 
excess of 50 hosts and it generally just seems to be these two hosts in 
question that bites into this ... I agree with your assessment.  Except 
that at least the AHCI host never *used* to do this and only fairly 
recently started with this behaviour.

So here's what I personally *think* makes these two hosts unique:

1.  The ACHI controller hosts unfortunately ~15 years back was set up 
with "thick" volumes and use traditional snapshots (The hardware has 
been replaced piecemeal over the years so none of the original hardware 
is still in use).  This started exhibiting the same behaviour where for 
reasons I can't go into we started making multiple snapshots of the same 
origin LV simultaneously - this is unfortunate, thin snaps would be way 
more performant during the few hours where these two snaps are required.

2.  The LSI controller on the SM host uses a thin pool of of 125TB and 
contains 27 "origins", 26 of which follows this pattern on a daily basis:
2.1  Create thin snap of ${name} as fsck_${name}.
2.2  fsck gets run on the snapshot to ensure consistency.  If this 
fails, bail out and report error to management systems.
2.3 if save_${name} exist, remote it.
2.4 rename fsck_${name} to save_${name}.

3.  IO on the SM host often goes in excess of 1GB/s and often "idles" 
around 400MB/s, which I'm sure in the bigger scheme of things isn't 
really that heavy of a load, but considering most of our other hosts 
barely peak at 150MB/s and generally don't do more than 10MB/s it's 
significant for us.  Right now as I'm typing this we're doing between 
1500 and 3000 reads/s (saw it peak just over 6000 now) and 500-1000 
writes/s (and peaked just over 3000).  I'm well aware there are systems 
with much higher IOPs values, but for us this is fairly high, even a few 
years back I saw statistics on systems doing 10k+ IOPs.

4.  Majority of our hosts with raid controllers are megaraid, I can't 
think of any other hosts off the top of my head also using mpt3sas, but 
we do have a number with AHCI.  This supports the theory again that it's 
the firmware on the controller, so I'll be sure to do that on Sat 
morning too when I've got a reboot slot. Hopefully that'll just make the 
problem go away.

Thanks for all the help in this, really appreciated.  I know we seem to 
be running in circles, but I believe we are making progress even if 
slowly, at a minimum I'm learning quite a bit which in and by itself is 
putting us in a better position to figure this out.  I do think that it 
could be controller, but as I've stated before as well, we've previously 
seen issues with snapshot creation for many years now, killing dmeventd 
sorted that out except on these two hosts now.  And they are special in 
that they create multiple snapshots of the same origin.  Perhaps that's 
the clue since frankly that's the one thing they share, and the one 
thing that makes them distinct from the other hosts we run.

Kind regards,
Jaco