Re: LVM kernel lockup scenario during lvcreate

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bart,


Just a follow up on this.


It seems even with the "none" scheduler we had an occurrence of this now.  Unfortunately I could not get to the host quickly enough in order to confirm ongoing IO, although based on the activity LEDs there were disks with IO.  I believe the disk controller controls these LEDs, but I'm not sure the pattern used to switch them on/off and this could vary from controller to controller (ie, do they go off only once the host has confirmed receipt of data, or once the data has been sent to the host?). This does seem to support your theory of a controller firmware issue.


It definitely happens more often with mq-deadline compared to none.


We're definitely seeing the same thing on another host using an ahci controller.  This seems to hint that it's not a firmware issue, as does the fact that this happens much less frequently with the none scheduler.


I will make a plan to action the firmware updates on the raid controller over the weekend regardless, just in order to eliminate that.  I will then revert to mq-deadline.  Assuming this does NOT fix it, how would I go about assessing if this is a controller firmware issue or a Linux kernel issue?


Come to think of it, it may be related or not, we've long since switched off dmeventd as running dmeventd causes this to happen on all hosts the moments any form of snapshots are involved.  With dmeventd combined with "heavy" use of the lv commands we could pretty much guarantee some level of lockup within a couple of days.


Kind regards,
Jaco


On 2023/07/13 17:07, Jaco Kroon wrote:

Hi Bart,


Not familiar at all with fio, so hoping this was OK.


On 2023/07/12 15:43, Bart Van Assche wrote:
On 7/12/23 03:12, Jaco Kroon wrote:
Ideas/Suggestions?

How about manually increasing the workload, e.g. by using fio to randomly read 4 KiB fragments with a high queue depth?

Bart.


[global]
kb_base=1024
unit_base=8
loops=10000
runtime=7200
time_based=1
directory=/home/fio
nrfiles=1
size=4194304
iodepth=256
ioengine=io_uring
numjobs=512
create_fsync=1

[reader]



crowsnest [17:01:35] ~ # fio --alloc-size=$(( 32 * 1024 )) fio.ini

Load averag went up to 1200+, IO was consistently 1GB/s read throughput, and IOPs anywhere between 100k and 500k, mostly around the 150k region.


Guessing the next step would be to restore mq-deadline as scheduler and re-do?


I've neglected to capture the output unfortunately, will do next run with --output if needed.  Can definitely initiate another run around 6:00am GMT in the morning.


Kind Regards,
Jaco




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux