If it is a spinning disk there is a limit to the total amount of updates(iops) read or write that can be done per second. In Raid5/6 a small read is 1 small read, but a small write requires several disks to get read and/or written to complete that single write, when a disk is failed, a read that would have had its data on the missing disk needs to read all other disks to recalculate the missing data. If your array was already busy, the extra reads and recalculates could push your array over its ability to do io limit. I set these 2 parms to limit the total amount of dirty cache and this appears to make the system more responsive. [root@bm-server ~]# sysctl -a | grep -i dirty | grep -i bytes vm.dirty_background_bytes = 3000000 vm.dirty_bytes = 5000000 The issue being that once you hit dirty_bytes all writes stop until you get the cache cleared back to dirty_background_bytes, and the greater the difference between the 2 the longer the freeze time is. Note that by default most distributions don't use the _bytes they use dirty_ratio/dirty_background_ratio that is a % of memory and can cause the write freeze to last for a significant amount of time since a single % is quite large if you have a lot of ram. You might also want to look at the io for the sd* devices under the md device as that would show what MD is doing under the covers to make the io happen (sar and/or iostat). On Tue, Oct 31, 2023 at 7:39 AM Carlos Carvalho <carlos@xxxxxxxxxxxxxx> wrote: > > eyal@xxxxxxxxxxxxxx (eyal@xxxxxxxxxxxxxx) wrote on Tue, Oct 31, 2023 at 06:29:14AM -03: > > More evidence that the problem relates to the cache not flushed to disk. > > Yes. > > > It seems that the array is slow to sync files somehow. Mythtv has no problems because it write > > only a few large files. rsync copies a very large number of small files which somehow triggers > > the problem. > > Mee too. Writing few files works fine, the problem happens when many files need > flushing. That's why expanding the kernel tree blocks the machine. After many > hours it either crashes or I have to do a hard reboot because all service > stops. > > It also happens with 6.1 but 6.5 is a lot more susceptible. Further, the longer > the uptime the more prone to deadlock the machine becomes...