Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

Matthew Sakai <msakai@xxxxxxxxxx> · Wed, 31 Jul 2024 17:33:41 -0400

On 7/31/24 10:14, Konstantin Kharlamov wrote:
CC'ing VDO maintainers, because the problem is only reproducible with
VDO, so potentially they might have some ideas.

I don't see anything that implicates VDO directly. The blocked VDO 
threads (with the test patch) seem to be stuck in raid5_make_request() 
so it seems like the raid itself is not handling requests in a timely 
manner.

There is one potentially useful detail, however: VDO mostly submits 4K 
bios. The large number of smaller bios may be exacerbating an issue in 
the raid5.

Matt

On Mon, 2024-07-22 at 20:56 +0300, Konstantin Kharlamov wrote:
Hi, sorry for the delay, I had to give away the nodes and we had a
week
of teambuilding and company party, so for the past week I only
managed
to hack away stripping debug symbols, get another node and set it up.

Experiments below are based off of vanilla 6.9.8 kernel *without*
your
patch.

On Mon, 2024-07-15 at 09:56 +0800, Yu Kuai wrote:
Line number will be helpful.

So, after tinkering with building scripts I managed to build modules
with debug symbols (not the kernel itself but should be good enough),
but for some reason kernel doesn't show line numbers in stacktraces.
No
idea what could be causing it, so I had to decode line numbers
manually, below is an output where I inserted line numbers for
raid456
manually after decoding them with `gdb`.

     […]
     [ 1677.293366]  <TASK>
     [ 1677.293661]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
     [ 1677.293972]  ? _raw_spin_unlock_irq+0x10/0x30
     [ 1677.294276]  ? _raw_spin_unlock_irq+0xa/0x30
     [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
     [ 1677.294910]  md_thread+0xc1/0x170
     [ 1677.295228]  ? __pfx_autoremove_wake_function+0x10/0x10
     [ 1677.295545]  ? __pfx_md_thread+0x10/0x10
     [ 1677.295870]  kthread+0xff/0x130
     [ 1677.296189]  ? __pfx_kthread+0x10/0x10
     [ 1677.296498]  ret_from_fork+0x30/0x50
     [ 1677.296810]  ? __pfx_kthread+0x10/0x10
     [ 1677.297112]  ret_from_fork_asm+0x1a/0x30
     [ 1677.297424]  </TASK>
     […]
     [ 1705.296253]  <TASK>
     [ 1705.296554]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
     [ 1705.296864]  ? _raw_spin_unlock_irq+0x10/0x30
     [ 1705.297172]  ? _raw_spin_unlock_irq+0xa/0x30
     [ 1677.294586]  raid5d at drivers/md/raid5.c:6597
     [ 1705.297794]  md_thread+0xc1/0x170
     [ 1705.298099]  ? __pfx_autoremove_wake_function+0x10/0x10
     [ 1705.298409]  ? __pfx_md_thread+0x10/0x10
     [ 1705.298714]  kthread+0xff/0x130
     [ 1705.299022]  ? __pfx_kthread+0x10/0x10
     [ 1705.299333]  ret_from_fork+0x30/0x50
     [ 1705.299641]  ? __pfx_kthread+0x10/0x10
     [ 1705.299947]  ret_from_fork_asm+0x1a/0x30
     [ 1705.300257]  </TASK>
     […]
     [ 1733.296255]  <TASK>
     [ 1733.296556]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
     [ 1733.296862]  ? _raw_spin_unlock_irq+0x10/0x30
     [ 1733.297170]  ? _raw_spin_unlock_irq+0xa/0x30
     [ 1677.294586]  raid5d at drivers/md/raid5.c:6572
     [ 1733.297792]  md_thread+0xc1/0x170
     [ 1733.298096]  ? __pfx_autoremove_wake_function+0x10/0x10
     [ 1733.298403]  ? __pfx_md_thread+0x10/0x10
     [ 1733.298711]  kthread+0xff/0x130
     [ 1733.299018]  ? __pfx_kthread+0x10/0x10
     [ 1733.299330]  ret_from_fork+0x30/0x50
     [ 1733.299637]  ? __pfx_kthread+0x10/0x10
     [ 1733.299943]  ret_from_fork_asm+0x1a/0x30
     [ 1733.300251]  </TASK>

Meanwhile, can you check if the underlying
disks has IO while raid5 stuck, by /sys/block/[device]/inflight.

The two devices that are left after the 3rd one is removed has these
numbers that don't change with time:

     [Mon Jul 22 20:18:06 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
cat
     /sys/block/$d/inflight; done
     dm-19       9        1
     dm-17      11        2
     [Mon Jul 22 20:18:11 @ ~]:> for d in dm-19 dm-17; do echo -n $d;
cat
     /sys/block/$d/inflight; done
     dm-19       9        1
     dm-17      11        2

They also don't change after I return the disk back (which is to be
expected I guess, given that the lockup doesn't go away).

At first, can the problem reporduce with raid1/raid10? If not,
this
is
probably a raid5 bug.

This is not reproducible with raid1 (i.e. no lockups for raid1),
I
tested that. I didn't test raid10, if you want I can try (but
probably
only after the weekend, because today I was asked to give the
nodes
away, for the weekend at least, to someone else).

Yes, please try raid10 as well. For now I'll say this is a raid5
problem.

Tested: raid10 works just fine, i.e. no lockup and fio continues
having non-zero IOPS.

The best will be that if I can reporduce this problem myself.
The problem is that I don't understand the step 4: turning off
jbod
slot's power, is this only possible for a real machine, or can
I
do
this in my VM?

Well, let's say that if it is possible, I don't know a way to do
that.
The `sg_ses` commands that I used

	sg_ses --dev-slot-num=9 --set=3:4:1   /dev/sg26 #
turning
off
	sg_ses --dev-slot-num=9 --clear=3:4:1 /dev/sg26 #
turning
on

…sets and clears the value of the 3:4:1 bit, where the bit is
defined
by the JBOD's manufacturer datasheet. The 3:4:1 specifically is
defined
by "AIC" manufacturer. That means the command as is unlikely to
work on
a different hardware.

I never do this before, I'll try.

Well, while on it, do you have any thoughts why just using a
`echo
1 >
/sys/block/sdX/device/delete` doesn't reproduce it? Does perhaps
kernel
not emulate device disappearance too well?

echo 1 > delete just delete the disk from kernel, and scsi/dm-raid
will
know that this disk is deleted. However, the disk will stay in
kernel
for the other way, dm-raid does not aware that underlying disks are
problematic and IO will still be generated and issued.

Thanks,
Kuai