Hi, I have a single mdadm raid6 in a 56-drive raid60 (7x8) with a kernel thread stuck at 100% cpu. The stuck thread typically happens during array checks, but is not the resync thread - md122_raid6 is at 100% cpu, whereas md122_resync is at ~0%. When this happens, the reported sync speed drops until it reaches 4K/sec. Setting sync_action to idle gets stuck. iostat shows backing devices aren't doing anything i/o wise, SMART is clean for all member drives, and dmesg doesn't say anything useful (until the thread is hung for a long time, then it tells me as much - I'll post that message when the current issue times out). A reboot typically clears the issue, but takes quite a long time, as the raid 60 is the backing device for a bcache device (with an optane cache) that has a large mounted xfs file system in place. I figured I could strace the process, but I learned that's impossible with kernel threads :) Output of various things - please let me know what else I can run to help track this down: /prod/mdstat: md118 : active raid0 md120[4] md119[5] md123[6] md125[3] md121[0] md124[1] md122[2] 410183875584 blocks super 1.2 3072k chunks md119 : active raid6 sdbh[1] sdbi[2] sdan[4] sdbc[0] sdar[7] sdaq[6] sdbe[8] sdao[5] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] md120 : active raid6 sdbd[7] sdat[1] sdaz[4] sday[3] sdau[2] sdba[5] sdbb[6] sdas[0] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] md121 : active raid6 sdaj[5] sdag[2] sdal[7] sdai[4] sdae[0] sdak[6] sdaf[1] sdah[3] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] md122 : active raid6 sdu[7] sdq[3] sdr[4] sdp[2] sdn[0] sdt[6] sds[5] sdo[1] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] [================>....] check = 81.5% (7963280396/9766304768) finish=147106.8min speed=204K/sec md123 : active raid6 sdax[7] sdaw[6] sdav[5] sdap[4] sdy[3] sdc[0] sdd[1] sdh[2] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] md124 : active raid6 sdab[5] sdaa[4] sdad[7] sdz[3] sdv[0] sdx[2] sdac[6] sdw[1] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] md125 : active raid6 sde[0] sdam[7] sdg[2] sdbg[8] sdf[1] sdi[3] sdk[5] sdj[4] 58597828608 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/8] [UUUUUUUU] /proc/{PID of md122_raid6}/stack alternates between nothing and: [<0>] ops_run_io+0x3e/0xdb0 [raid456] [<0>] handle_stripe+0x144/0x1260 [raid456] [<0>] handle_active_stripes.isra.0+0x3c5/0x5a0 [raid456] [<0>] raid5d+0x35c/0x550 [raid456] [<0>] md_thread+0x97/0x160 [<0>] kthread+0x114/0x150 [<0>] ret_from_fork+0x22/0x30 /proc/{PID of md122_raid6}/status: Name: md122_raid6 Umask: 0000 State: R (running) Tgid: 2167 Ngid: 0 Pid: 2167 PPid: 2 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: NStgid: 2167 NSpid: 2167 NSpgid: 0 NSsid: 0 Threads: 1 SigQ: 0/1031010 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: fffffffffffffeff SigCgt: 0000000000000100 CapInh: 0000000000000000 CapPrm: 000000ffffffffff CapEff: 000000ffffffffff CapBnd: 000000ffffffffff CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 0 Speculation_Store_Bypass: thread vulnerable Cpus_allowed: ffffff Cpus_allowed_list: 0-23 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 73369830 nonvoluntary_ctxt_switches: 29419786 /proc/{PID of md122_raid6}/stat: 2167 (md122_raid6) R 2 0 0 0 -1 2129984 0 0 0 0 0 5079064 0 0 20 0 1 0 1724 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483391 256 0 0 0 17 21 0 0 390998 0 0 0 0 0 0 0 0 0 0 mdadm -D {raid_60_device}: /dev/md118: Version : 1.2 Creation Time : Sun Apr 5 13:43:11 2020 Raid Level : raid0 Array Size : 410183875584 (391181.83 GiB 420028.29 GB) Raid Devices : 7 Total Devices : 7 Persistence : Superblock is persistent Update Time : Sun Apr 5 13:43:11 2020 State : clean Active Devices : 7 Working Devices : 7 Failed Devices : 0 Spare Devices : 0 Layout : -unknown- Chunk Size : 3072K Consistency Policy : none Name : host:all_spinners UUID : 74727e9d:8d3cd62a:48369430:dea1e4eb Events : 0 Number Major Minor RaidDevice State 0 9 121 0 active sync /dev/md/host:spinners_1 1 9 124 1 active sync /dev/md/host:spinners_2 2 9 122 2 active sync /dev/md/host:spinners_3 3 9 125 3 active sync /dev/md/host:spinners_4 4 9 120 4 active sync /dev/md/host:spinners_5 5 9 119 5 active sync /dev/md/host:spinners_6 6 9 123 6 active sync /dev/md/host:spinners_7 mdadm -D {md122, frozen device}: /dev/md122: Version : 1.2 Creation Time : Sat Apr 4 10:12:53 2020 Raid Level : raid6 Array Size : 58597828608 (55883.24 GiB 60004.18 GB) Used Dev Size : 9766304768 (9313.87 GiB 10000.70 GB) Raid Devices : 8 Total Devices : 8 Persistence : Superblock is persistent Update Time : Mon Feb 15 12:02:41 2021 State : active, checking Active Devices : 8 Working Devices : 8 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Consistency Policy : resync Check Status : 81% complete Name : host:spinners_3 UUID : 331bc2af:3207b40c:983b923f:14fe1762 Events : 5869 Number Major Minor RaidDevice State 0 8 208 0 active sync /dev/sdn 1 8 224 1 active sync /dev/sdo 2 8 240 2 active sync /dev/sdp 3 65 0 3 active sync /dev/sdq 4 65 16 4 active sync /dev/sdr 5 65 32 5 active sync /dev/sds 6 65 48 6 active sync /dev/sdt 7 65 64 7 active sync /dev/sdu