Re: parity raid and ext4 get stuck in writes

Carlos Carvalho <carlos@xxxxxxxxxxxxxx> · Mon, 25 Dec 2023 10:38:29 -0300

Peter Grandi (pg@xxxxxxxxxxxxxxxxxxxxxx) wrote on Mon, Dec 25, 2023 at 07:15:16AM -03:
> >> [...] a long standing problem. When lots of writes to many
> >> files are sent in a short time the kernel gets stuck and
> >> stops sending write requests to the disks. [...] A simple way
> >> to reproduce: expand a kernel source tree, like xzcat
> >> linux-6.5.tar.xz | tar x -f -
> 
> That is a well known (ideally...) consequence of misconfiguring
> both physical storage and the Linux flusher cache so there is a
> high chance of post-saturation congestion under load.
> 
> https://www.sabi.co.uk/blog/anno05-4th.html?051105#051105

No.

It's not a configuration problem, it's a kernel bug. Of course we can reduce
the number and size of dirty pages, as I mentioned myself in the post, but the
bug continues to exist. I even did it to keep a critical server alive. It is a
nuisance though because bursts of disk writes take much longer to complete.

Even restraining dirty pages after about 7-10 days that critical machine still
gets stuck and needs a reboot... As time goes by the machine becomes more
susceptible to the bug. Maybe because of memory fragmentation? This is only a
"wild guess", I have no idea if it makes sense and agrees with Ojaswin's
findings.