On Mar 20, 2021, at 7:16 AM, Song Liu <song@xxxxxxxxxx> wrote: > > Sorry for being late on this issue. > > Manuel and Vojtech, are we confident that this issue only happens when we use > another md array as the journal device? > > Thanks, > Song Hi Song, thanks for getting back. Unfortunately it's still happening, even when using a NVMe partition directly. It just took a long 3 weeks to happen. So discard my patch. Here how it went down yesterday: - process md4_raid6 is running with 100% CPU utilization, all I/O to the array is blocked - no disk activity on the physical drives - soft reboot doesn't work, as md4_raid6 blocks, so hard reset is needed - when booting to rescue mode, it tries to assemble the array and shows the same issue of 100% CPU utilization. Also can't reboot. - when manually assembling it *with* the journal drive, it will read a few GB from the journal device and then get stuck at 100% CPU utilization again without any disk activity. Solution in the end was to avoid assembling the array on reboot, then assemble it *without* the existing journal and add an empty journal drive later. This lead to some data loss and a full resync. I'm currently moving all data off this machine and will repave it. Then see if that changes anything. My main OS is CentOS 8 and the rescue system was Debian. Both showed a similar issue. This must be connected to the journal drive somehow. My journal drive is a partition on an NVMe with ~180GB in size. Thanks for any pointers, I could try next. Manu