Re: Linux RAID with btrfs stuck and consume 100 % CPU

Manuel Riel <manu@xxxxxxxxxxxxx> · Sun, 28 Feb 2021 16:34:57 +0800

Hit another mdadm "hanger" today. No more reading possible and md4_raid6 stuck at 100% CPU.

I've now moved the write journal off the RAID1 device. So it's not a "nested" RAID any more. Hope this will help.

With only one hardware device used as write cache, I suppose only write-through mode[1] is suggested now.

1: https://www.kernel.org/doc/Documentation/md/raid5-cache.txt

> On Feb 11, 2021, at 11:14, Manuel Riel <manu@xxxxxxxxxxxxx> wrote:
> 
> I'm also hitting the exact same problem with XFS on RAID6 using a RAID1 
> write journal on two NVMes. CentOS 8, 4.18.0-240.10.1.el8_3.x86_64.
> 
> Symptoms:
> 
> - high CPU usage of md4_raid6 process
> - IO wait goes up
> - IO on that file system locks up for tens of minutes and the kernel reports:
> 
> [Wed Feb 10 23:23:05 2021] INFO: task md4_reclaim:1070 blocked for more than 20 seconds.
> [Wed Feb 10 23:23:05 2021]       Not tainted 4.18.0-240.10.1.el8_3.x86_64 #1
> [Wed Feb 10 23:23:05 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Wed Feb 10 23:23:05 2021] md4_reclaim     D    0  1070      2 0x80004000
> 
> Already confirmed it's not a timeout mismatch. No drive errors reported. SCT Error Recovery
> Control is set to 7 seconds