Re: Linux RAID with btrfs stuck and consume 100 % CPU

Vojtech Myslivec <vojtech@xxxxxxxxxxxx> · Wed, 17 Mar 2021 16:55:36 +0100

Thanks a lot Manuel for your findings and information.

It's good to know btrfs is not causing this issue and the common symptom 
is an MD journal on another RAID device.

I have moved journal from logical volume on RAID1 to a plain partition 
on a SSD and I will monitor the state.

Vojtech

On 17. 03. 21 5:35, Manuel Riel wrote:
Final update on this issue for anyone who encounters a similar problem in the future:

I didn't observe any "hanging" RAID devices after using an ordinary NVMe partition as journal. So using e.g. another md-RAID1 array as journal doesn't seem to be supported.

The docs[1] say "This means the cache disk must be ... sustainable." The sustainable part motivated me to use a md-RAID1 array. I think the docs should mention that the journal can't be on another RAID array.

I'm sending in a patch to emphasize this in the docs.

1: https://www.kernel.org/doc/html/latest/driver-api/md/raid5-cache.html

On Feb 28, 2021, at 4:34 PM, Manuel Riel <manu@xxxxxxxxxxxxx> wrote:

Hit another mdadm "hanger" today. No more reading possible and md4_raid6 stuck at 100% CPU.

I've now moved the write journal off the RAID1 device. So it's not a "nested" RAID any more. Hope this will help.

With only one hardware device used as write cache, I suppose only write-through mode[1] is suggested now.

1: https://www.kernel.org/doc/Documentation/md/raid5-cache.txt