I'm also hitting the exact same problem with XFS on RAID6 using a RAID1 write journal on two NVMes. CentOS 8, 4.18.0-240.10.1.el8_3.x86_64. Symptoms: - high CPU usage of md4_raid6 process - IO wait goes up - IO on that file system locks up for tens of minutes and the kernel reports: [Wed Feb 10 23:23:05 2021] INFO: task md4_reclaim:1070 blocked for more than 20 seconds. [Wed Feb 10 23:23:05 2021] Not tainted 4.18.0-240.10.1.el8_3.x86_64 #1 [Wed Feb 10 23:23:05 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Wed Feb 10 23:23:05 2021] md4_reclaim D 0 1070 2 0x80004000 Already confirmed it's not a timeout mismatch. No drive errors reported. SCT Error Recovery Control is set to 7 seconds >> It's kindof a complicated setup. When this problem happens, can you >> check swap pressure? There is a RAID1 SWAP partition, but it's almost unused, since the server has ample of RAM. >> >> /sys/fs/cgroup/memory.stat >> >> pgfault and maybe also pgmajfault - see if they're going up; or also >> you can look at vmstat and see how heavy swap is being used at the >> time. The thing is. >> >> Because any heavy eviction means writes to dm-0->md0 raid1->sdg+sdh >> SSDs, which are the same SSDs that you have the md1 raid6 mdadm >> journal going to. So if you have any kind of swap pressure, it very >> likely will stop the journal or at least substantially slow it down, >> and now you get blocked tasks as the pressure builds more and more >> because now you have a ton of dirty writes in Btrfs that can't make it >> to disk. I've disabled SWAP to test this theory. >> If there is minimal swap usage, then this hypothesis is false and >> something else is going on. I also don't have an explanation why your >> work around works. > > Sadly, I am not able to _disable the journal_ if I do - just by removing > the device from the array - the MD device instantly fails and btrfs > volume remounts read-only. I can not find any other way how to disable > the journal, it seems it is not supported. I can see only > `--add-journal` option and no corresponding `--delete-journal` one. > > I welcome any advice how to exchange write-journal with internal bitmap. I read that the array needs to be in read-only mode. Then you can fail and replace the write journal. (not tested) # mdadm --readonly /dev/md0 # mdadm /dev/md0 --fail /dev/<journal> # mdadm --manage /dev/md0 --add-journal /dev/<new-journal> > Any other possible changes that comes to my mind are: > - Enlarge write-journal My write journal is about 180 GB > - Move write-journal to physical sdg/sdh SSDs (out from md0 raid1 > device). I may try this, but as you say it's risky, especially when using "write-back" journal mode > I find the later a bit risky, as the write-journal is not redundant > then. That's the reason we choose write journal on RAID device. I'm also experimenting with write-back/write-through mode and different stripe_cache_size. Hoping to find something. If nothing helps, it may not be possible/supported to put a write journal on another RAID device?