On 12/22/23 12:48 PM, Carlos Carvalho wrote:
This is finally a summary of a long standing problem. When lots of writes to
many files are sent in a short time the kernel gets stuck and stops sending
write requests to the disks. Sometimes it recovers and finally sends the
modified pages to permanent storage, sometimes not and eventually other
functions degrade and the machine crashes.
A simple way to reproduce: expand a kernel source tree, like
xzcat linux-6.5.tar.xz | tar x -f -
This sounds almost exactly like a problem I was having, right down to
triggering it by writing the files of a kernel tree, though the details
in my case are slightly different. I wanted to report it, but wanted to
get a better handle on it and never managed it, and now I've changed my
setup such that it doesn't happen anymore.
- it happens only with ext4 on a parity raid array
This is where it differs for me. I experienced it only with btrfs. But I
had two arrays with it, one on SSDs and one on HDDs. The HDD array
exhibited the problem almost exclusively (the SSDs, I think, exhibited
it once in several months, while the HDDs did pretty much every time I
tried to compile a new kernel (until I started working around it), and
even from some other things, which was a couple of times a week). I
imagine because HDDs much slower and therefore allow more data to get
cached.
Now that I've switched the HDD array to ext4, I haven't experienced the
issue even once. But the setup has better performance, so maybe it's
just because it flushes its writes faster.
--
PGP fingerprint: 5BBD5080FEB0EF7F142F8173D572B791F7B4422A