Peter Grandi (pg@xxxxxxxxxxxxxxxxxxxxxx) wrote on Mon, Dec 25, 2023 at 07:15:16AM -03: > >> [...] a long standing problem. When lots of writes to many > >> files are sent in a short time the kernel gets stuck and > >> stops sending write requests to the disks. [...] A simple way > >> to reproduce: expand a kernel source tree, like xzcat > >> linux-6.5.tar.xz | tar x -f - > > That is a well known (ideally...) consequence of misconfiguring > both physical storage and the Linux flusher cache so there is a > high chance of post-saturation congestion under load. > > https://www.sabi.co.uk/blog/anno05-4th.html?051105#051105 No. It's not a configuration problem, it's a kernel bug. Of course we can reduce the number and size of dirty pages, as I mentioned myself in the post, but the bug continues to exist. I even did it to keep a critical server alive. It is a nuisance though because bursts of disk writes take much longer to complete. Even restraining dirty pages after about 7-10 days that critical machine still gets stuck and needs a reboot... As time goes by the machine becomes more susceptible to the bug. Maybe because of memory fragmentation? This is only a "wild guess", I have no idea if it makes sense and agrees with Ojaswin's findings.