Dirty page flushing regression in 6.5.x vs 6.1.x

Roman Mamedov <rm@xxxxxxxxxxx> · Fri, 3 Nov 2023 19:32:08 +0500

On Fri, 3 Nov 2023 11:16:57 -0300
Carlos Carvalho <carlos@xxxxxxxxxxxxxx> wrote:

> Johannes Truschnigg (johannes@xxxxxxxxxxxxxxx) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> > 
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
> 
> Exactly. That's what I've been saying for months...
> 
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.

If you have a 100% way to reproduce, next what would be ideal to do is a
bisect to narrow it down to which commit introduced the problem. Of course it
might be not feasible to reboot dozens of times on production machines. Still,
maybe for a start you could narrow it down some more, such as check kernels
around 6.2, 6.3? Those are not offered anymore on kernel.org, but should be
retrievable from distro repositories or git.

Also check the 6.6 kernel which has been released recently.

-- 
With respect,
Roman