Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

Bryan Gurney <bgurney@xxxxxxxxxx> · Wed, 31 Jul 2024 16:41:16 -0400

Hi Konstantin,

This sounds a lot like something that I encountered with md, back in
2019, on the old vdo-devel mailing list:

https://listman.redhat.com/archives/vdo-devel/2019-August/000171.html

Basically, I had a RAID-5 md array that was in the process of recovery:

$ cat /proc/mdstat
Personalities : [raid0] [raid6] [raid5] [raid4]
md0 : active raid5 sde[4] sdd[2] sdc[1] sdb[0]
      2929890816 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [=>...................]  recovery =  9.1% (89227836/976630272)
finish=85.1min speed=173727K/sec
      bitmap: 0/8 pages [0KB], 65536KB chunk

Note that the speed of the recovery is 173,727 KB/sec, which is less
than the sync_speed_max value:

$ grep . /sys/block/md0/md/sync_speed*
/sys/block/md0/md/sync_speed:171052
/sys/block/md0/md/sync_speed_max:200000 (system)
/sys/block/md0/md/sync_speed_min:1000 (system)

...And when I decreased "sync_speed_max" to "65536", I stopped seeing
hung task timeouts.

There's a similar setting in dm-raid: the "--maxrecoveryrate" option
of lvchange.  So, to set the maximum recovery rate to 64 MiB per
second per device, this would be the command, for an example VG/LV of
"p_r5/testdmraid5"

# lvchange --maxrecoveryrate 64M p_r5/testdmraid5

(Older hard disk drives may not have a sequential read / write speed
of more than 100 MiB/sec; this meant that md's default of 200 MiB/sec
was "too fast", and would result in the recovery I/O starving the VDO
volume from being able to service I/O.)

The current value of max_recovery_rate for dm-raid can be displayed
with "lvs -a -o +raid_max_recovery_rate".

By reducing the maximum recovery rate for the dm-raid RAID-5 logical
volume, does this result in the hung task timeouts for the
"dm-vdo0-bioQ*" to not appear, and for the fio job to continue
writing?

Thanks,

Bryan

On Wed, Jul 31, 2024 at 10:21 AM Konstantin Kharlamov
<Hi-Angel@xxxxxxxxx> wrote:
>
> CC'ing VDO maintainers, because the problem is only reproducible with
> VDO, so potentially they might have some ideas.
>
> On Mon, 2024-07-22 at 20:56 +0300, Konstantin Kharlamov wrote:
> > Hi, sorry for the delay, I had to give away the nodes and we had a
> > week
> > of teambuilding and company party, so for the past week I only
> > managed
> > to hack away stripping debug symbols, get another node and set it up.
> >
> > Experiments below are based off of vanilla 6.9.8 kernel *without*
> > your
> > patch.
> >
> > On Mon, 2024-07-15 at 09:56 +0800, Yu Kuai wrote:
> > > Line number will be helpful.
> >
> > So, after tinkering with building scripts I managed to build modules
> > with debug symbols (not the kernel itself but should be good enough),
> > but for some reason kernel doesn't show line numbers in stacktraces.
> > No
> > idea what could be causing it, so I had to decode line numbers
> > manually, below is an output where I inserted line numbers for
> > raid456
> > manually after decoding them with `gdb`.
> >
> >     […]
> >     [ 1677.293366]  <TASK>
> >     [ 1677.293661]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >     [ 1677.293972]  ? _raw_spin_unlock_irq+0x10/0x30
> >     [ 1677.294276]  ? _raw_spin_unlock_irq+0xa/0x30
> >     [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
> >     [ 1677.294910]  md_thread+0xc1/0x170
> >     [ 1677.295228]  ? __pfx_autoremove_wake_function+0x10/0x10
> >     [ 1677.295545]  ? __pfx_md_thread+0x10/0x10
> >     [ 1677.295870]  kthread+0xff/0x130
> >     [ 1677.296189]  ? __pfx_kthread+0x10/0x10
> >     [ 1677.296498]  ret_from_fork+0x30/0x50
> >     [ 1677.296810]  ? __pfx_kthread+0x10/0x10
> >     [ 1677.297112]  ret_from_fork_asm+0x1a/0x30
> >     [ 1677.297424]  </TASK>
> >     […]
> >     [ 1705.296253]  <TASK>
> >     [ 1705.296554]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >     [ 1705.296864]  ? _raw_spin_unlock_irq+0x10/0x30
> >     [ 1705.297172]  ? _raw_spin_unlock_irq+0xa/0x30
> >     [ 1677.294586]  raid5d at drivers/md/raid5.c:6597
> >     [ 1705.297794]  md_thread+0xc1/0x170
> >     [ 1705.298099]  ? __pfx_autoremove_wake_function+0x10/0x10
> >     [ 1705.298409]  ? __pfx_md_thread+0x10/0x10
> >     [ 1705.298714]  kthread+0xff/0x130
> >     [ 1705.299022]  ? __pfx_kthread+0x10/0x10
> >     [ 1705.299333]  ret_from_fork+0x30/0x50
> >     [ 1705.299641]  ? __pfx_kthread+0x10/0x10
> >     [ 1705.299947]  ret_from_fork_asm+0x1a/0x30
> >     [ 1705.300257]  </TASK>
> >     […]
> >     [ 1733.296255]  <TASK>
> >     [ 1733.296556]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >     [ 1733.296862]  ? _raw_spin_unlock_irq+0x10/0x30
> >     [ 1733.297170]  ? _raw_spin_unlock_irq+0xa/0x30
> >     [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
> >     [ 1733.297792]  md_thread+0xc1/0x170
> >     [ 1733.298096]  ? __pfx_autoremove_wake_function+0x10/0x10
> >     [ 1733.298403]  ? __pfx_md_thread+0x10/0x10
> >     [ 1733.298711]  kthread+0xff/0x130
> >     [ 1733.299018]  ? __pfx_kthread+0x10/0x10
> >     [ 1733.299330]  ret_from_fork+0x30/0x50
> >     [ 1733.299637]  ? __pfx_kthread+0x10/0x10
> >     [ 1733.299943]  ret_from_fork_asm+0x1a/0x30
> >     [ 1733.300251]  </TASK>
> >
> > > Meanwhile, can you check if the underlying
> > > disks has IO while raid5 stuck, by /sys/block/[device]/inflight.
> >
> > The two devices that are left after the 3rd one is removed has these
> > numbers that don't change with time:
> >
> >     [Mon Jul 22 20:18:06 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
> > cat
> >     /sys/block/$d/inflight; done
> >     dm-19       9        1
> >     dm-17      11        2
> >     [Mon Jul 22 20:18:11 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
> > cat
> >     /sys/block/$d/inflight; done
> >     dm-19       9        1
> >     dm-17      11        2
> >
> > They also don't change after I return the disk back (which is to be
> > expected I guess, given that the lockup doesn't go away).
> >
> > > >
> > > > > At first, can the problem reporduce with raid1/raid10? If not,
> > > > > this
> > > > > is
> > > > > probably a raid5 bug.
> > > >
> > > > This is not reproducible with raid1 (i.e. no lockups for raid1),
> > > > I
> > > > tested that. I didn't test raid10, if you want I can try (but
> > > > probably
> > > > only after the weekend, because today I was asked to give the
> > > > nodes
> > > > away, for the weekend at least, to someone else).
> > >
> > > Yes, please try raid10 as well. For now I'll say this is a raid5
> > > problem.
> >
> > Tested: raid10 works just fine, i.e. no lockup and fio continues
> > having non-zero IOPS.
> >
> > > > > The best will be that if I can reporduce this problem myself.
> > > > > The problem is that I don't understand the step 4: turning off
> > > > > jbod
> > > > > slot's power, is this only possible for a real machine, or can
> > > > > I
> > > > > do
> > > > > this in my VM?
> > > >
> > > > Well, let's say that if it is possible, I don't know a way to do
> > > > that.
> > > > The `sg_ses` commands that I used
> > > >
> > > >   sg_ses --dev-slot-num=9 --set=3:4:1   /dev/sg26 #
> > > > turning
> > > > off
> > > >   sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 #
> > > > turning
> > > > on
> > > >
> > > > …sets and clears the value of the 3:4:1 bit, where the bit is
> > > > defined
> > > > by the JBOD's manufacturer datasheet. The 3:4:1 specifically is
> > > > defined
> > > > by "AIC" manufacturer. That means the command as is unlikely to
> > > > work on
> > > > a different hardware.
> > >
> > > I never do this before, I'll try.
> > > >
> > > > Well, while on it, do you have any thoughts why just using a
> > > > `echo
> > > > 1 >
> > > > /sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps
> > > > kernel
> > > > not emulate device disappearance too well?
> > >
> > > echo 1 > delete just delete the disk from kernel, and scsi/dm-raid
> > > will
> > > know that this disk is deleted. However, the disk will stay in
> > > kernel
> > > for the other way, dm-raid does not aware that underlying disks are
> > > problematic and IO will still be generated and issued.
> > >
> > > Thanks,
> > > Kuai
>
>