MD Array Unexpected Kernel Hang

"Alan Braithwaite" <alan@xxxxxxxxxxxxxxx> · Wed, 08 Jun 2022 14:36:40 -0700

Forgive me if this is the wrong place to ask for help on this issue.  I've scoured the internet for more tools to debug these arrays and came up short.

I'm afraid I may have done something wrong and gotten my linux md raid array into a weird state (can't re-add a failed device).  First, it started with a disk failure.

First some initial context: The array is a non-root partition 5-device raid6 array in a VM using drives passed into the vm via virtio.  Host drives are a JBOD array connected to an LSI HBA.  I do not pass through the LSI controller PCI device to the VM for simplicity sake.  I'm not sure if that's relevant, but the point is there's another layer of indirection if it matters.

I observed the disk failure via kernel tasks getting hung reading/writing to disk, despite all disks being reported as healthy via smartmontools on both the host and the guest.  Smartmon tests were fine also (as far as I can tell, a few slow reads but nothing crazy).

I couldn't unmount the disk or stop the array properly with `mdadm —stop /dev/md127`as both commands would hang indefinitely and the md_raid6 kernel task was also in the `D` uninterruptable sleep state.  No threads on the host machine were blocked on IO during this time.  I tried a normal shutdown of the VM, but when that hung indefinitely also (waited an hour), I forced the shut down of the machine.

I brought it back up briefly to verify that there was still an issue and had the same results.  I was able to start and mount the array (ext4 fs on top).  However I was unable to read it or unmount it.   I then forcefully brought the machine back down again.

I took a wild guess and replaced the first drive in the array which also showed the highest read errors corrected by ECC according to smartmon (approx 300k over 470TiB processed).  Fortunately, the array came online and was mountable/readable again.  I repaced it with a spare drive I had by adding a new disk in KVM and putting at the same mount point `/dev/vdb`.  I added that drive to the array via: `sfdisk -d /dev/vdc | sfdisk /dev/vdb`.  Then `mdadm --manage /dev/md127 --add /dev/vdb1`.

After the sync finished, that drive worked for a few days until I went to reboot the machine for updates again at which point the system hung at shutdown again failing to unmount the array.

So I go through the same process to replace the drive with yet another drive.  This time I tried rebooting immediately after the sync/recovery finished for the newly added drive.  Lo and behold, this one hung too.  What are the chances that 3 drives failed at almost the same exact time?

Mounting the array with the 4 disks not in that first position works for mounting the array readonly at least (I have not tried read/write yet).

How do I go about debugging this?  The `mdadm —examine` command reports `State: clean` for every drive.  Checksums are correct. Everything looks fine from the tools I can see.  There were no useful md logs in dmesg, aside from the hung task backtrace (below).

I know that detecting disk failures can be tricky (particularly if they're not throwing errors), but I also don't expect md arrays to completely hang when trying to safely abort usage.

Below are the details of my setup.

Please let me know how I can go about debugging this and recovering the full array?  What should I do from here?  It feels like a software issue since I'm not able to see any other errors via other tools and the "faulty" disks read fine from the host (haven't tried writing to them).

Please let me know if there's any more information I can provide.

Thanks,
- Alan

## Host Details

Hardware Details:

```
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3.3
        Release Date: 07/19/2018
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 12 MB
        BIOS Revision: 4.6

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Supermicro
        Product Name: X9DRT
        Version: 0123456789
        Serial Number: 0123456789
        UUID: 00000000-0000-0000-0000-002590a29e98
        Wake-up Type: Power Switch
        SKU Number: To be filled by O.E.M.
        Family: To be filled by O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: Supermicro
        Product Name: X9DRT
        Version: 1.21
        Serial Number: ZM2AU38792
        Asset Tag: To be filled by O.E.M.
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: To be filled by O.E.M.
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

Handle 0x0004, DMI type 4, 42 bytes
Processor Information
        Socket Designation: SOCKET 0
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel
        ID: E4 06 03 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 62, Stepping 4
```

Host HBA device:

```
LSI 9200-8e 6Gbps 8-lane external SAS HBA P20 IT Mode
```

Arch Linux on Host

```
[abraithwaite@ceres ~]$ uname -a
Linux ceres 5.17.9-arch1-1 #1 SMP PREEMPT Wed, 18 May 2022 17:30:11 +0000 x86_64 GNU/Linux
[abraithwaite@ceres ~]$ pacman -Q | grep -i libvirt
libvirt 1:8.3.0-1
[abraithwaite@ceres ~]$ pacman -Q | grep -i qemu-common
qemu-common 7.0.0-10
```

Host storage details

```
sudo dmesg | grep -i lsi
[    1.862556] mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(00.00.00.00)
[    3.419509] scsi 7:0:9:0: Enclosure         LSI CORP SAS2X36          0718 PQ: 0 ANSI: 5
[abraithwaite@ceres ~]$ sudo dmesg | grep -E '(lsi|sas|scsi)'
[    3.379175] mpt2sas_cm0: hba_port entry: 00000000e975d737, port: 255 is added to hba_port list
[    3.384029] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b0044145d0), phys(8)
[    3.386017] mpt2sas_cm0: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x50030480003954bf), phys(38)
[    3.386589]  expander-7:0: add: handle(0x0009), sas_addr(0x50030480003954bf)
[    3.397307] mpt2sas_cm0: handle(0xa) sas_address(0x5000cca06d2807be) port_type(0x1)
[    3.398416] scsi 7:0:0:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.398427] scsi 7:0:0:0: SSP: handle(0x000a), sas_addr(0x5000cca06d2807be), phy(8), device_name(0x5000cca06d2807be)
[    3.398431] scsi 7:0:0:0: enclosure logical id (0x500304800039543f), slot(0)
[    3.398438] scsi 7:0:0:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.398676] scsi 7:0:0:0: Power-on or device reset occurred
[    3.399879]  end_device-7:0:0: add: handle(0x000a), sas_addr(0x5000cca06d2807be)
[    3.400117] mpt2sas_cm0: handle(0xb) sas_address(0x5000cca06d2927fe) port_type(0x1)
[    3.400901] scsi 7:0:1:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.400906] scsi 7:0:1:0: SSP: handle(0x000b), sas_addr(0x5000cca06d2927fe), phy(9), device_name(0x5000cca06d2927fe)
[    3.400907] scsi 7:0:1:0: enclosure logical id (0x500304800039543f), slot(1)
[    3.400911] scsi 7:0:1:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.401085] scsi 7:0:1:0: Power-on or device reset occurred
[    3.402223]  end_device-7:0:1: add: handle(0x000b), sas_addr(0x5000cca06d2927fe)
[    3.402454] mpt2sas_cm0: handle(0xc) sas_address(0x5000cca06d2924f2) port_type(0x1)
[    3.403399] scsi 7:0:2:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.403404] scsi 7:0:2:0: SSP: handle(0x000c), sas_addr(0x5000cca06d2924f2), phy(10), device_name(0x5000cca06d2924f2)
[    3.403406] scsi 7:0:2:0: enclosure logical id (0x500304800039543f), slot(2)
[    3.403410] scsi 7:0:2:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.403586] scsi 7:0:2:0: Power-on or device reset occurred
[    3.404672]  end_device-7:0:2: add: handle(0x000c), sas_addr(0x5000cca06d2924f2)
[    3.404915] mpt2sas_cm0: handle(0xd) sas_address(0x5000cca06d283b0a) port_type(0x1)
[    3.405667] scsi 7:0:3:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.405672] scsi 7:0:3:0: SSP: handle(0x000d), sas_addr(0x5000cca06d283b0a), phy(11), device_name(0x5000cca06d283b0a)
[    3.405674] scsi 7:0:3:0: enclosure logical id (0x500304800039543f), slot(3)
[    3.405678] scsi 7:0:3:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.405855] scsi 7:0:3:0: Power-on or device reset occurred
[    3.406944]  end_device-7:0:3: add: handle(0x000d), sas_addr(0x5000cca06d283b0a)
[    3.407182] mpt2sas_cm0: handle(0xe) sas_address(0x5000cca0284dc06a) port_type(0x1)
[    3.407901] scsi 7:0:4:0: Direct-Access     HGST     HUS724020ALS640  A2C0 PQ: 0 ANSI: 6
[    3.407906] scsi 7:0:4:0: SSP: handle(0x000e), sas_addr(0x5000cca0284dc06a), phy(12), device_name(0x5000cca0284dc06a)
[    3.407908] scsi 7:0:4:0: enclosure logical id (0x500304800039543f), slot(4)
[    3.407912] scsi 7:0:4:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.408088] scsi 7:0:4:0: Power-on or device reset occurred
[    3.409221]  end_device-7:0:4: add: handle(0x000e), sas_addr(0x5000cca0284dc06a)
[    3.409465] mpt2sas_cm0: handle(0xf) sas_address(0x5000cca06d287592) port_type(0x1)
[    3.410156] scsi 7:0:5:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.410161] scsi 7:0:5:0: SSP: handle(0x000f), sas_addr(0x5000cca06d287592), phy(14), device_name(0x5000cca06d287592)
[    3.410162] scsi 7:0:5:0: enclosure logical id (0x500304800039543f), slot(6)
[    3.410166] scsi 7:0:5:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.410342] scsi 7:0:5:0: Power-on or device reset occurred
[    3.411402]  end_device-7:0:5: add: handle(0x000f), sas_addr(0x5000cca06d287592)
[    3.411643] mpt2sas_cm0: handle(0x10) sas_address(0x5000cca06d3da1be) port_type(0x1)
[    3.412464] scsi 7:0:6:0: Direct-Access     HGST     HUS724020ALS640  A2C0 PQ: 0 ANSI: 6
[    3.412469] scsi 7:0:6:0: SSP: handle(0x0010), sas_addr(0x5000cca06d3da1be), phy(15), device_name(0x5000cca06d3da1be)
[    3.412471] scsi 7:0:6:0: enclosure logical id (0x500304800039543f), slot(7)
[    3.412474] scsi 7:0:6:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.412669] scsi 7:0:6:0: Power-on or device reset occurred
[    3.413767]  end_device-7:0:6: add: handle(0x0010), sas_addr(0x5000cca06d3da1be)
[    3.413997] mpt2sas_cm0: handle(0x11) sas_address(0x5000cca06d2919ee) port_type(0x1)
[    3.414669] scsi 7:0:7:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.414674] scsi 7:0:7:0: SSP: handle(0x0011), sas_addr(0x5000cca06d2919ee), phy(18), device_name(0x5000cca06d2919ee)
[    3.414676] scsi 7:0:7:0: enclosure logical id (0x500304800039543f), slot(10)
[    3.414679] scsi 7:0:7:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.414855] scsi 7:0:7:0: Power-on or device reset occurred
[    3.415859]  end_device-7:0:7: add: handle(0x0011), sas_addr(0x5000cca06d2919ee)
[    3.416099] mpt2sas_cm0: handle(0x12) sas_address(0x5000cca06d28092a) port_type(0x1)
[    3.416734] scsi 7:0:8:0: Direct-Access     HGST     HUS724020ALS640  A1C4 PQ: 0 ANSI: 6
[    3.416739] scsi 7:0:8:0: SSP: handle(0x0012), sas_addr(0x5000cca06d28092a), phy(19), device_name(0x5000cca06d28092a)
[    3.416741] scsi 7:0:8:0: enclosure logical id (0x500304800039543f), slot(11)
[    3.416744] scsi 7:0:8:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
[    3.416921] scsi 7:0:8:0: Power-on or device reset occurred
[    3.418104]  end_device-7:0:8: add: handle(0x0012), sas_addr(0x5000cca06d28092a)
[    3.418734] mpt2sas_cm0: handle(0x13) sas_address(0x50030480003954bd) port_type(0x1)
[    3.419509] scsi 7:0:9:0: Enclosure         LSI CORP SAS2X36          0718 PQ: 0 ANSI: 5
[    3.419514] scsi 7:0:9:0: set ignore_delay_remove for handle(0x0013)
[    3.419516] scsi 7:0:9:0: SES: handle(0x0013), sas_addr(0x50030480003954bd), phy(36), device_name(0x50030480003954bd)
[    3.419518] scsi 7:0:9:0: enclosure logical id (0x500304800039543f), slot(28)
[    3.419521] scsi 7:0:9:0: qdepth(254), tagged(1), scsi_level(6), cmd_que(1)
[    3.419642] scsi 7:0:9:0: Power-on or device reset occurred
[    3.421412]  end_device-7:0:9: add: handle(0x0013), sas_addr(0x50030480003954bd)
[    9.644938] mpt2sas_cm0: port enable: SUCCESS
```

Passing through SAS drives to guest in KVM via this kind of libvirt configuration:

```
<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/sdh1" index="7"/>
  <backingStore/>
  <target dev="vdb" bus="virtio"/>
  <alias name="virtio-disk1"/>
  <address type="pci" domain="0x0000" bus="0x08" slot="0x00" function="0x0"/>
</disk>
```

Host disks are formatted as such:

```
sudo fdisk -l /dev/sdb
Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HUS724020ALS640
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xaa654211

Device     Boot Start        End    Sectors  Size Id Type
/dev/sdb1        2048 3907029167 3907027120  1.8T 83 Linux
```

## Guest Details

Arch Linux Guest, recently updated the kernel (saw some patches in 5.18.2 that made me hopeful, but alas I'm still having the issue).

```
[root@arch arch]# uname -a
Linux arch 5.18.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 06 Jun 2022 19:58:58 +0000 x86_64 GNU/Linux
```

Bad disk (currently out of the array):

```
mdadm --examine /dev/vdb1
/dev/vdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 238d6359:a2f569cb:7cc48d36:43ba9632
           Name : arch:jbod  (local to host arch)
  Creation Time : Sat Dec  4 18:34:57 2021
     Raid Level : raid6
   Raid Devices : 5

Avail Dev Size : 3906760847 sectors (1862.89 GiB 2000.26 GB)
     Array Size : 5860141056 KiB (5.46 TiB 6.00 TB)
  Used Dev Size : 3906760704 sectors (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=143 sectors
          State : clean
    Device UUID : 5a1d61a1:a95d4a05:0e5a9337:d6f76dbb

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jun  8 12:36:41 2022
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : 3eab5336 - correct
         Events : 19946

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing, 'R' == replacing)
```

Good disk:

```
mdadm --examine /dev/vdc1
/dev/vdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 238d6359:a2f569cb:7cc48d36:43ba9632
           Name : arch:jbod  (local to host arch)
  Creation Time : Sat Dec  4 18:34:57 2021
     Raid Level : raid6
   Raid Devices : 5

Avail Dev Size : 3906760847 sectors (1862.89 GiB 2000.26 GB)
     Array Size : 5860141056 KiB (5.46 TiB 6.00 TB)
  Used Dev Size : 3906760704 sectors (1862.89 GiB 2000.26 GB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=143 sectors
          State : clean
    Device UUID : d2857c0f:dcd8b710:101dfb14:6b3f7ff5

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Jun  8 14:08:50 2022
  Bad Block Log : 512 entries available at offset 16 sectors
       Checksum : cfae5712 - correct
         Events : 19954

         Layout : left-symmetric
     Chunk Size : 256K

   Device Role : Active device 1
   Array State : .AAAA ('A' == active, '.' == missing, 'R' == replacing)
```

Dmesg hang:

```
Jun 05 13:24:17 arch kernel: INFO: task kworker/0:2:258 blocked for more than 122 seconds.
Jun 05 13:24:17 arch kernel:       Not tainted 5.18.1-arch1-1 #1
Jun 05 13:24:17 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 05 13:24:17 arch kernel: task:kworker/0:2     state:D stack:    0 pid:  258 ppid:     2 flags:0x00004000
Jun 05 13:24:17 arch kernel: Workqueue: md submit_flushes [md_mod]
Jun 05 13:24:17 arch kernel: Call Trace:
Jun 05 13:24:17 arch kernel:  <TASK>
Jun 05 13:24:17 arch kernel:  ? wbt_done+0xb0/0xb0
Jun 05 13:24:17 arch kernel:  __schedule+0x37c/0x11f0
Jun 05 13:24:17 arch kernel:  ? check_preempt_wakeup+0x28b/0x2a0
Jun 05 13:24:17 arch kernel:  ? rwb_trace_step+0x80/0x80
Jun 05 13:24:17 arch kernel:  ? wbt_done+0xb0/0xb0
Jun 05 13:24:17 arch kernel:  schedule+0x4f/0xb0
Jun 05 13:24:17 arch kernel:  io_schedule+0x46/0x70
Jun 05 13:24:17 arch kernel:  rq_qos_wait+0xc0/0x130
Jun 05 13:24:17 arch kernel:  ? karma_partition+0x280/0x280
Jun 05 13:24:17 arch kernel:  ? rwb_trace_step+0x80/0x80
Jun 05 13:24:17 arch kernel:  wbt_wait+0xa6/0x110
Jun 05 13:24:17 arch kernel:  __rq_qos_throttle+0x27/0x40
Jun 05 13:24:17 arch kernel:  blk_mq_submit_bio+0x3c4/0x640
Jun 05 13:24:17 arch kernel:  __submit_bio+0xf2/0x180
Jun 05 13:24:17 arch kernel:  submit_bio_noacct_nocheck+0x20b/0x2c0
Jun 05 13:24:17 arch kernel:  submit_flushes+0xd8/0x150 [md_mod 728f525e20ac2cfceb893dc85f8d68d92d31c960]
Jun 05 13:24:17 arch kernel:  process_one_work+0x1c7/0x380
Jun 05 13:24:17 arch kernel:  worker_thread+0x51/0x380
Jun 05 13:24:17 arch kernel:  ? rescuer_thread+0x3a0/0x3a0
Jun 05 13:24:17 arch kernel:  kthread+0xde/0x110
Jun 05 13:24:17 arch kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 05 13:24:17 arch kernel:  ret_from_fork+0x22/0x30
Jun 05 13:24:17 arch kernel:  </TASK>
```

5.18.2 dmesg changes slightly:

```
Jun 08 12:48:50 arch kernel: INFO: task md127_raid6:420 blocked for more than 614 seconds.
Jun 08 12:48:50 arch kernel:       Tainted: P           OE     5.18.2-arch1-1 #1
Jun 08 12:48:50 arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 08 12:48:50 arch kernel: task:md127_raid6     state:D stack:    0 pid:  420 ppid:     2 flags:0x00004000
Jun 08 12:48:50 arch kernel: Call Trace:
Jun 08 12:48:50 arch kernel:  <TASK>
Jun 08 12:48:50 arch kernel:  __schedule+0x37c/0x11f0
Jun 08 12:48:50 arch kernel:  ? __slab_free+0xe0/0x310
Jun 08 12:48:50 arch kernel:  schedule+0x4f/0xb0
Jun 08 12:48:50 arch kernel:  md_super_wait+0x9f/0xd0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? cpuacct_percpu_seq_show+0x20/0x20
Jun 08 12:48:50 arch kernel:  md_bitmap_daemon_work+0x213/0x3a0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  md_check_recovery+0x47/0x5a0 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  raid5d+0x8d/0x680 [raid456 a9c4e3a091d6fc6eff1c917206c669e086d59fa9]
Jun 08 12:48:50 arch kernel:  ? lock_timer_base+0x61/0x80
Jun 08 12:48:50 arch kernel:  ? md_set_read_only+0x90/0x90 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? del_timer_sync+0x73/0xb0
Jun 08 12:48:50 arch kernel:  ? md_set_read_only+0x90/0x90 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  md_thread+0xad/0x190 [md_mod afcab5f485650b4a150ec4b265d57c09e8217b2a]
Jun 08 12:48:50 arch kernel:  ? cpuacct_percpu_seq_show+0x20/0x20
Jun 08 12:48:50 arch kernel:  kthread+0xde/0x110
Jun 08 12:48:50 arch kernel:  ? kthread_complete_and_exit+0x20/0x20
Jun 08 12:48:50 arch kernel:  ret_from_fork+0x22/0x30
Jun 08 12:48:50 arch kernel:  </TASK>
```

Now tainted because I installed zfs drivers, since that's my next option.