Hi,
On 2023/08/24 19:13, Bart Van Assche wrote:
On 8/24/23 00:29, Jaco Kroon wrote:
We're definitely seeing the same thing on another host using an ahci
controller. This seems to hint that it's not a firmware issue, as
does the fact that this happens much less frequently with the none
scheduler.
That is unexpected. I don't think there is enough data available yet to
conclude whether these issues are identical or not?
It's hard for me to even conclude that two consecutive crashes are even
exactly the same issue ... however, there's strong correlation in that
there generally are lvcreate commands in D state which to me hints that
it's something to do with LVM snapshot creation (both traditional - ahci
controller, and thing - super micro).
I will make a plan to action the firmware updates on the raid
controller over the weekend regardless, just in order to eliminate
that. I will then revert to mq-deadline. Assuming this does NOT fix
it, how would I go about assessing if this is a controller firmware
issue or a Linux kernel issue?
If the root cause would be an issue in the mq-deadline scheduler or in
the core block layer then there would be many more reports about I/O
lockups. For this case I think that it's very likely that the root
cause is either the I/O controller driver or the I/O controller firmware.
I tend to agree with that. And given the fact that we probably have in
excess of 50 hosts and it generally just seems to be these two hosts in
question that bites into this ... I agree with your assessment. Except
that at least the AHCI host never *used* to do this and only fairly
recently started with this behaviour.
So here's what I personally *think* makes these two hosts unique:
1. The ACHI controller hosts unfortunately ~15 years back was set up
with "thick" volumes and use traditional snapshots (The hardware has
been replaced piecemeal over the years so none of the original hardware
is still in use). This started exhibiting the same behaviour where for
reasons I can't go into we started making multiple snapshots of the same
origin LV simultaneously - this is unfortunate, thin snaps would be way
more performant during the few hours where these two snaps are required.
2. The LSI controller on the SM host uses a thin pool of of 125TB and
contains 27 "origins", 26 of which follows this pattern on a daily basis:
2.1 Create thin snap of ${name} as fsck_${name}.
2.2 fsck gets run on the snapshot to ensure consistency. If this
fails, bail out and report error to management systems.
2.3 if save_${name} exist, remote it.
2.4 rename fsck_${name} to save_${name}.
3. IO on the SM host often goes in excess of 1GB/s and often "idles"
around 400MB/s, which I'm sure in the bigger scheme of things isn't
really that heavy of a load, but considering most of our other hosts
barely peak at 150MB/s and generally don't do more than 10MB/s it's
significant for us. Right now as I'm typing this we're doing between
1500 and 3000 reads/s (saw it peak just over 6000 now) and 500-1000
writes/s (and peaked just over 3000). I'm well aware there are systems
with much higher IOPs values, but for us this is fairly high, even a few
years back I saw statistics on systems doing 10k+ IOPs.
4. Majority of our hosts with raid controllers are megaraid, I can't
think of any other hosts off the top of my head also using mpt3sas, but
we do have a number with AHCI. This supports the theory again that it's
the firmware on the controller, so I'll be sure to do that on Sat
morning too when I've got a reboot slot. Hopefully that'll just make the
problem go away.
Thanks for all the help in this, really appreciated. I know we seem to
be running in circles, but I believe we are making progress even if
slowly, at a minimum I'm learning quite a bit which in and by itself is
putting us in a better position to figure this out. I do think that it
could be controller, but as I've stated before as well, we've previously
seen issues with snapshot creation for many years now, killing dmeventd
sorted that out except on these two hosts now. And they are special in
that they create multiple snapshots of the same origin. Perhaps that's
the clue since frankly that's the one thing they share, and the one
thing that makes them distinct from the other hosts we run.
Kind regards,
Jaco