Hi Konstantin, This sounds a lot like something that I encountered with md, back in 2019, on the old vdo-devel mailing list: https://listman.redhat.com/archives/vdo-devel/2019-August/000171.html Basically, I had a RAID-5 md array that was in the process of recovery: $ cat /proc/mdstat Personalities : [raid0] [raid6] [raid5] [raid4] md0 : active raid5 sde[4] sdd[2] sdc[1] sdb[0] 2929890816 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] [=>...................] recovery = 9.1% (89227836/976630272) finish=85.1min speed=173727K/sec bitmap: 0/8 pages [0KB], 65536KB chunk Note that the speed of the recovery is 173,727 KB/sec, which is less than the sync_speed_max value: $ grep . /sys/block/md0/md/sync_speed* /sys/block/md0/md/sync_speed:171052 /sys/block/md0/md/sync_speed_max:200000 (system) /sys/block/md0/md/sync_speed_min:1000 (system) ...And when I decreased "sync_speed_max" to "65536", I stopped seeing hung task timeouts. There's a similar setting in dm-raid: the "--maxrecoveryrate" option of lvchange. So, to set the maximum recovery rate to 64 MiB per second per device, this would be the command, for an example VG/LV of "p_r5/testdmraid5" # lvchange --maxrecoveryrate 64M p_r5/testdmraid5 (Older hard disk drives may not have a sequential read / write speed of more than 100 MiB/sec; this meant that md's default of 200 MiB/sec was "too fast", and would result in the recovery I/O starving the VDO volume from being able to service I/O.) The current value of max_recovery_rate for dm-raid can be displayed with "lvs -a -o +raid_max_recovery_rate". By reducing the maximum recovery rate for the dm-raid RAID-5 logical volume, does this result in the hung task timeouts for the "dm-vdo0-bioQ*" to not appear, and for the fio job to continue writing? Thanks, Bryan On Wed, Jul 31, 2024 at 10:21 AM Konstantin Kharlamov <Hi-Angel@xxxxxxxxx> wrote: > > CC'ing VDO maintainers, because the problem is only reproducible with > VDO, so potentially they might have some ideas. > > On Mon, 2024-07-22 at 20:56 +0300, Konstantin Kharlamov wrote: > > Hi, sorry for the delay, I had to give away the nodes and we had a > > week > > of teambuilding and company party, so for the past week I only > > managed > > to hack away stripping debug symbols, get another node and set it up. > > > > Experiments below are based off of vanilla 6.9.8 kernel *without* > > your > > patch. > > > > On Mon, 2024-07-15 at 09:56 +0800, Yu Kuai wrote: > > > Line number will be helpful. > > > > So, after tinkering with building scripts I managed to build modules > > with debug symbols (not the kernel itself but should be good enough), > > but for some reason kernel doesn't show line numbers in stacktraces. > > No > > idea what could be causing it, so I had to decode line numbers > > manually, below is an output where I inserted line numbers for > > raid456 > > manually after decoding them with `gdb`. > > > > […] > > [ 1677.293366] <TASK> > > [ 1677.293661] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [ 1677.293972] ? _raw_spin_unlock_irq+0x10/0x30 > > [ 1677.294276] ? _raw_spin_unlock_irq+0xa/0x30 > > [ 1677.294586] raid5d at drivers/md/raid5.c:6572 > > [ 1677.294910] md_thread+0xc1/0x170 > > [ 1677.295228] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 1677.295545] ? __pfx_md_thread+0x10/0x10 > > [ 1677.295870] kthread+0xff/0x130 > > [ 1677.296189] ? __pfx_kthread+0x10/0x10 > > [ 1677.296498] ret_from_fork+0x30/0x50 > > [ 1677.296810] ? __pfx_kthread+0x10/0x10 > > [ 1677.297112] ret_from_fork_asm+0x1a/0x30 > > [ 1677.297424] </TASK> > > […] > > [ 1705.296253] <TASK> > > [ 1705.296554] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [ 1705.296864] ? _raw_spin_unlock_irq+0x10/0x30 > > [ 1705.297172] ? _raw_spin_unlock_irq+0xa/0x30 > > [ 1677.294586] raid5d at drivers/md/raid5.c:6597 > > [ 1705.297794] md_thread+0xc1/0x170 > > [ 1705.298099] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 1705.298409] ? __pfx_md_thread+0x10/0x10 > > [ 1705.298714] kthread+0xff/0x130 > > [ 1705.299022] ? __pfx_kthread+0x10/0x10 > > [ 1705.299333] ret_from_fork+0x30/0x50 > > [ 1705.299641] ? __pfx_kthread+0x10/0x10 > > [ 1705.299947] ret_from_fork_asm+0x1a/0x30 > > [ 1705.300257] </TASK> > > […] > > [ 1733.296255] <TASK> > > [ 1733.296556] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [ 1733.296862] ? _raw_spin_unlock_irq+0x10/0x30 > > [ 1733.297170] ? _raw_spin_unlock_irq+0xa/0x30 > > [ 1677.294586] raid5d at drivers/md/raid5.c:6572 > > [ 1733.297792] md_thread+0xc1/0x170 > > [ 1733.298096] ? __pfx_autoremove_wake_function+0x10/0x10 > > [ 1733.298403] ? __pfx_md_thread+0x10/0x10 > > [ 1733.298711] kthread+0xff/0x130 > > [ 1733.299018] ? __pfx_kthread+0x10/0x10 > > [ 1733.299330] ret_from_fork+0x30/0x50 > > [ 1733.299637] ? __pfx_kthread+0x10/0x10 > > [ 1733.299943] ret_from_fork_asm+0x1a/0x30 > > [ 1733.300251] </TASK> > > > > > Meanwhile, can you check if the underlying > > > disks has IO while raid5 stuck, by /sys/block/[device]/inflight. > > > > The two devices that are left after the 3rd one is removed has these > > numbers that don't change with time: > > > > [Mon Jul 22 20:18:06 @ ~]:> for d in dm-19 dm-17; do echo -n $d; > > cat > > /sys/block/$d/inflight; done > > dm-19 9 1 > > dm-17 11 2 > > [Mon Jul 22 20:18:11 @ ~]:> for d in dm-19 dm-17; do echo -n $d; > > cat > > /sys/block/$d/inflight; done > > dm-19 9 1 > > dm-17 11 2 > > > > They also don't change after I return the disk back (which is to be > > expected I guess, given that the lockup doesn't go away). > > > > > > > > > > > At first, can the problem reporduce with raid1/raid10? If not, > > > > > this > > > > > is > > > > > probably a raid5 bug. > > > > > > > > This is not reproducible with raid1 (i.e. no lockups for raid1), > > > > I > > > > tested that. I didn't test raid10, if you want I can try (but > > > > probably > > > > only after the weekend, because today I was asked to give the > > > > nodes > > > > away, for the weekend at least, to someone else). > > > > > > Yes, please try raid10 as well. For now I'll say this is a raid5 > > > problem. > > > > Tested: raid10 works just fine, i.e. no lockup and fio continues > > having non-zero IOPS. > > > > > > > The best will be that if I can reporduce this problem myself. > > > > > The problem is that I don't understand the step 4: turning off > > > > > jbod > > > > > slot's power, is this only possible for a real machine, or can > > > > > I > > > > > do > > > > > this in my VM? > > > > > > > > Well, let's say that if it is possible, I don't know a way to do > > > > that. > > > > The `sg_ses` commands that I used > > > > > > > > sg_ses --dev-slot-num=9 --set=3:4:1 /dev/sg26 # > > > > turning > > > > off > > > > sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 # > > > > turning > > > > on > > > > > > > > …sets and clears the value of the 3:4:1 bit, where the bit is > > > > defined > > > > by the JBOD's manufacturer datasheet. The 3:4:1 specifically is > > > > defined > > > > by "AIC" manufacturer. That means the command as is unlikely to > > > > work on > > > > a different hardware. > > > > > > I never do this before, I'll try. > > > > > > > > Well, while on it, do you have any thoughts why just using a > > > > `echo > > > > 1 > > > > > /sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps > > > > kernel > > > > not emulate device disappearance too well? > > > > > > echo 1 > delete just delete the disk from kernel, and scsi/dm-raid > > > will > > > know that this disk is deleted. However, the disk will stay in > > > kernel > > > for the other way, dm-raid does not aware that underlying disks are > > > problematic and IO will still be generated and issued. > > > > > > Thanks, > > > Kuai > >