On 17. 09. 20 19:08, Chris Murphy wrote: > > On Wed, Sep 16, 2020 at 3:42 AM Vojtech Myslivec wrote: >> >> Description of the devices in iostat, just for recap: >> - sda-sdf: 6 HDD disks >> - sdg, sdh: 2 SSD disks >> >> - md0: raid1 over sdg1 and sdh1 ("SSD RAID", Physical Volume for LVM) >> - md1: our "problematic" raid6 over sda-sdf ("HDD RAID", btrfs >> formatted) >> >> - Logical volumes over md0 Physical Volume (on SSD RAID) >> - dm-0: 4G LV for SWAP >> - dm-1: 16G LV for root file system (ext4 formatted) >> - dm-2: 1G LV for md1 journal > > It's kindof a complicated setup. When this problem happens, can you > check swap pressure? > > /sys/fs/cgroup/memory.stat > > pgfault and maybe also pgmajfault - see if they're going up; or also > you can look at vmstat and see how heavy swap is being used at the > time. The thing is. > > Because any heavy eviction means writes to dm-0->md0 raid1->sdg+sdh > SSDs, which are the same SSDs that you have the md1 raid6 mdadm > journal going to. So if you have any kind of swap pressure, it very > likely will stop the journal or at least substantially slow it down, > and now you get blocked tasks as the pressure builds more and more > because now you have a ton of dirty writes in Btrfs that can't make it > to disk. > > If there is minimal swap usage, then this hypothesis is false and > something else is going on. I also don't have an explanation why your > work around works. On 17. 09. 20 19:20, Chris Murphy wrote: > The iostat isn't particularly revealing, I don't see especially high > %util for any device. SSD write MB/s gets up to 42 which is > reasonable. On 17. 09. 20 19:43, Chris Murphy wrote: > [Mon Aug 31 15:31:55 2020] sysrq: Show Blocked State > [Mon Aug 31 15:31:55 2020] task PC stack pid father > > [Mon Aug 31 15:31:55 2020] md1_reclaim D 0 806 2 0x80004000 > [Mon Aug 31 15:31:55 2020] Call Trace: > ... > > *shrug* > > These SSDs should be able to handle > 500MB/s. And > 130K IOPS. Swap > would have to be pretty heavy to slow down journal writes. > > I'm not sure I have any good advise. My remaining ideas involve > changing configuration just to see if the problem goes away, rather > than actually understanding the cause of the problem. OK, I see. This is a physical server with 32 GB RAM and dedicated to backup tasks. Our monitoring shows there is (almost) no swap usage all the time. So I hope this should not be the problem. However, I would look for the stats you mentioned and, for start, I would disable the swap for some several days. It's there only as a "backup" for any case, and it is not used at all most of the time. Sadly, I am not able to _disable the journal_ if I do - just by removing the device from the array - the MD device instantly fails and btrfs volume remounts read-only. I can not find any other way how to disable the journal, it seems it is not supported. I can see only `--add-journal` option and no corresponding `--delete-journal` one. I welcome any advice how to exchange write-journal with internal bitmap. Any other possible changes that comes to my mind are: - Enlarge write-journal - Move write-journal to physical sdg/sdh SSDs (out from md0 raid1 device). I find the later a bit risky, as the write-journal is not redundant then. That's the reason we choose write journal on RAID device. Vojtech