My impression is that the write-journal feature isn't fully stable yet, as was already reported in 2019[^1]. Vojtech and me are seeing the same errors as mentioned there. No matter if the journal is on a block device or another RAID. 1: https://www.spinics.net/lists/raid/msg62646.html > On Mar 20, 2021, at 9:12 AM, Manuel Riel <manu@xxxxxxxxxxxxx> wrote: > > On Mar 20, 2021, at 7:16 AM, Song Liu <song@xxxxxxxxxx> wrote: >> >> Sorry for being late on this issue. >> >> Manuel and Vojtech, are we confident that this issue only happens when we use >> another md array as the journal device? >> >> Thanks, >> Song > > Hi Song, > > thanks for getting back. > > Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday: > > - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked > - no disk activity on the physical drives > - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed > - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot. > - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity. > > Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync. > > I'm currently moving all data off this machine and will repave it. Then see if that changes anything. > > My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow. > > My journal drive is a partition on an NVMe with ~180GB in size. > > Thanks for any pointers, I could try next. > > Manu