Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Fri, 12 Jul 2024 16:35:50 +0800

Hi,

在 2024/07/12 14:53, Konstantin Kharlamov 写道:
(this is a dup of yesterday's bugzilla bug 219030 as some maintainers
prefer bugreporting over bugzilla some over ML, and the cases are
indistinguishable)

At least I'll keep an eye on ML, not bugzilla.

We're seeing this on both LTS `6.1.97` as well the newest stable
`6.9.8` (note though that VDO was only merged in 6.9, so for upstream
it's probably the only one that matters. Traces below are from 6.9.8).

Creating a raid5 (alternatively raid6) + VDO, then starting a load and
after a minute physically taking out a disk of the VG results in a
lockup stack getting spammed over and over.

Seems unreproducible by removing a disk through software means of
`/sys/block/sdX/device/delete`, i.e. it seem to only happen when either
removing the disk physically or by disabling power on the JBOD slot
with a command like `sg_ses --dev-slot-num=9 --set=3:4:1 /dev/sg26`.

Also when lockup happens, `reboot -ff` no longer does anything.

When looking at repeating stacktraces, the execution seems to loop
around `raid5_get_active_stripe`, `raid5_compute_sector`,
`raid5_release_stripe`.

# Steps to reproduce

1. Create raid5 LV + VDO by executing a `./mk_lvm_raid5.sh /dev/sdX1
/dev/sdY1 /dev/sdZ1` where `mk_lvm_raid5.sh` has the following content:

     #!/bin/bash

     set -exu

     if [ "$#" -ne 3 ]; then
         echo "Wrong number of parameters.
     Usage: $(basename $0) disk1 disk2 disk3"
         exit 1
     fi

     # create the VG
     pvcreate -f "$1" "$2" "$3"
     vgcreate p_r5 "$1" "$2" "$3"

     # create the LV
     lvcreate --type raid5 -i 2 -L 21474836480b -I 64K -n
vdo_internal_deco_vol p_r5 -y
     lvconvert -y --type vdo-pool --virtualsize 107374182400B -n
deco_vol p_r5/vdo_internal_deco_vol

2. Start load by executing `fio ./fio-30%write.fio` with the `fio-
30%write.fio` having the following content:

     [test IOPS]
     blocksize=8k
     filename=/dev/p_r5/deco_vol
     filesize=100G
     direct=1
     buffered=0
     ioengine=libaio
     iodepth=32
     rw=randrw
     rwmixwrite=30
     numjobs=4
     group_reporting
     time_based
     runtime=99h
     clat_percentiles=0
     unlink=1

3. Wait for about a minute
4. Remove a disk of the volume group, either physically, or by turning
off jbod slot's power (DO NOT use /…/device/delete).

Looks like this is because IO is failed from raid level, and then dm
level keep retry this IO(This will be related to the step 4), hence
raid5d stuck in the loop to hanlde new IO.

Can you give the following patch a test to confirm this?

Thanks!
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c14cf2410365..a0f784cd664c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6776,7 +6776,9 @@ static void raid5d(struct md_thread *thread)

                while ((bio = remove_bio_from_retry(conf, &offset))) {
                        int ok;
+
                        spin_unlock_irq(&conf->device_lock);
+                       cond_resched();
                        ok = retry_aligned_read(conf, bio, offset);
                        spin_lock_irq(&conf->device_lock);
                        if (!ok)
@@ -6790,11 +6792,11 @@ static void raid5d(struct md_thread *thread)
                        break;
                handled += batch_size;

-               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
-                       spin_unlock_irq(&conf->device_lock);
+               spin_unlock_irq(&conf->device_lock);
+               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING))
                        md_check_recovery(mddev);
-                       spin_lock_irq(&conf->device_lock);
-               }
+               cond_resched();
+               spin_lock_irq(&conf->device_lock);
        }
        pr_debug("%d stripes handled\n", handled);


5. Wait 30 seconds

## Expected

`dmesg -w` won't show anything interesting

## Actual

`dmesg -w` starts showing up traces as follows (not sure why the
missing line numbers, I have debug symbols locally, I can decode them
manually with `gdb` if needed):

     […]
     [  869.086048] mpt3sas_cm0: log_info(0x31110e05): originator(PL),
code(0x11), sub_code(0x0e05)
     [  869.139959] mpt3sas_cm0: mpt3sas_transport_port_remove: removed:
sas_addr(0x5002538a67b01303)
     [  869.140085] mpt3sas_cm0: removing handle(0x0022),
sas_addr(0x5002538a67b01303)
     [  869.140191] mpt3sas_cm0: enclosure logical
id(0x50015b2140128f7f), slot(12)
     [  869.140293] mpt3sas_cm0: enclosure level(0x0000), connector
name(     )
     [  893.262506] watchdog: BUG: soft lockup - CPU#5 stuck for 26s!
[mdX_raid5:24608]
     [  893.262641] Modules linked in: dm_raid mptctl mptbase bonding
fcoe libfcoe 8021q garp mrp libfc stp llc scsi_transport_fc
sch_fq_codel softdog dm_round_robin scsi_dh_alua intel_rapl_msr
coretemp intel_rapl_common sb_edac x86_pkg_temp_thermal
intel_powerclamp ast kvm_intel drm_shmem_helper drm_kms_helper kvm
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3
sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl
intel_cstate wdat_wdt intel_pch_thermal ipmi_ssif dm_multipath
acpi_ipmi sg ipmi_si ipmi_devintf ipmi_msghandler acpi_pad
acpi_power_meter ramoops reed_solomon efi_pstore drm ip_tables x_tables
autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq
async_xor xor async_tx raid6_pq libcrc32c raid0 ses enclosure mlx5_ib
ib_core raid1 ixgbe igb mlx5_core mdio_devres xfrm_algo i2c_algo_bit
e1000e mpt3sas hwmon dca mdio i2c_i801 ahci lpc_ich ptp i2c_smbus
mfd_core raid_class libahci pps_core libphy scsi_transport_sas
     [  893.264033] CPU: 5 PID: 24608 Comm: mdX_raid5 Not tainted 6.9.8-
bstrg #3
     [  893.264254] Hardware name: AIC HA401-LB2/LIBRA, BIOS LIBBV071
04/19/2017
     [  893.264479] RIP: 0010:_raw_spin_lock_irqsave+0xe/0x30
     [  893.264714] Code: 48 89 ef 5d e9 13 33 47 ff 0f 1f 00 90 90 90
90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 fd
9c 41 5c <fa> bf 01 00 00 00 e8 07 96 43 ff 48 89 ef e8 df 32 47 ff 4c
89 e0
     [  893.265229] RSP: 0018:ffffbc306091fd70 EFLAGS: 00000246
     [  893.265502] RAX: 0000000000000001 RBX: 0000000000000001 RCX:
0000000000000000
     [  893.265783] RDX: 0000000000000001 RSI: 0000000000000003 RDI:
ffff9fbacf48c430
     [  893.266069] RBP: ffff9fbacf48c430 R08: ffff9fcac1254d58 R09:
00000000ffffffff
     [  893.266366] R10: 0000000000000000 R11: ffffbc306091fe20 R12:
0000000000000246
     [  893.266664] R13: ffff9fcac1254dc0 R14: 0000000000000000 R15:
0000000000739d18
     [  893.266968] FS:  0000000000000000(0000)
GS:ffff9fc9fff40000(0000) knlGS:0000000000000000
     [  893.267282] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [  893.267603] CR2: 000055f91452ff88 CR3: 0000000124c74004 CR4:
00000000003706f0
     [  893.267933] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
     [  893.268268] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
     [  893.268605] Call Trace:
     [  893.268945]  <IRQ>
     [  893.269285]  ? watchdog_timer_fn+0x24b/0x2d0
     [  893.269638]  ? __pfx_watchdog_timer_fn+0x10/0x10
     [  893.269990]  ? __hrtimer_run_queues+0x112/0x2b0
     [  893.270347]  ? hrtimer_interrupt+0x101/0x240
     [  893.270708]  ? __sysvec_apic_timer_interrupt+0x6e/0x180
     [  893.271075]  ? sysvec_apic_timer_interrupt+0x9d/0xd0
     [  893.271448]  </IRQ>
     [  893.271816]  <TASK>
     [  893.272184]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
     [  893.272568]  ? _raw_spin_lock_irqsave+0xe/0x30
     [  893.272956]  __wake_up+0x1d/0x60
     [  893.273345]  md_wakeup_thread+0x38/0x70
     [  893.273738]  raid5d+0x32a/0x5d0 [raid456]
     [  893.274147]  md_thread+0xc1/0x170
     [  893.274545]  ? __pfx_autoremove_wake_function+0x10/0x10
     [  893.274951]  ? __pfx_md_thread+0x10/0x10
     [  893.275358]  kthread+0xff/0x130
     [  893.275760]  ? __pfx_kthread+0x10/0x10
     [  893.276172]  ret_from_fork+0x30/0x50
     [  893.276568]  ? __pfx_kthread+0x10/0x10
     [  893.276961]  ret_from_fork_asm+0x1a/0x30
     [  893.277351]  </TASK>

.