Re: problem with recovered array

Roger Heflin <rogerheflin@xxxxxxxxx> · Fri, 3 Nov 2023 10:57:10 -0500

On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@xxxxxxxxxxxxxx> wrote:
>
> Johannes Truschnigg (johannes@xxxxxxxxxxxxxxx) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03:
> > for the record, I do not think that any of the observations the OP made can be
> > explained by non-pathological phenomena/patterns of behavior. Something is
> > very clearly wrong with how this system behaves (the reported figures do not
> > at all match the expected performance of even a degraded RAID6 array in my
> > experience) and how data written to the filesystem apparently fails to make it
> > into the backing devices in acceptable time.
> >
> > The whole affair reeks either of "subtle kernel bug", or maybe "subtle
> > hardware failure", I think.
>
> Exactly. That's what I've been saying for months...
>
> I found a clear comparison: expanding the kernel tarball in the SAME MACHINE
> with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With
> 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these
> are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of
> dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as
> shown by sar, and the kworker/flushd runs with 100% cpu usage forever.
>
> Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is
> blocked in D state and the tarball expansion proceeds so slowly that it'd take
> days to complete (checked with quota).
>
> So 6.5 (and 6.4) are unusable in this case. In another machine, which does
> hundreds of rsync downloads every day, the same problem exists and I also get
> frequent random rsync timeouts.
>
> This is all with raid6 and ext4. One of the machines has a journal disk in the
> raid and the filesystem is mounted with nobarriers. Both show the same
> behavior. It'd be interesting to try a different filesystem but these are
> production machines with many disks and I cannot create another big array to
> transfer the contents.

My array is running 6.5 + xfs, and mine all seems to work normally
(speed wise).  And in the perf top call he ran all of the busy
kworkers were ext4* calls spending a lot of time doing various
filesystem work.

I did find/debug a situation where dumping the cache caused ext4
performance to be a disaster (large directories, lots of files).  It
was tracked back to ext4 relies on the Buffers:  data space in
/proc/meminfo for at least directory entry caching, and that if there
were a lot of directories and/or files in directories that Buffer:
getting dropped and/or getting pruned for any some reason caused the
fragmented directory entries to have to get reloaded from a spinning
disk and require the disk to be seeking for  *MINUTES* to reload it
(there were in this case several million files in a couple of
directories with the directory entries being allocated over time so
very likely heavily fragmented).

I wonder if there was some change with how Buffers is
used/sized/pruned in the recent kernels.   The same drop_cache on an
XFS filesystem had no effect that I could identify and doing a ls -lR
on a big xfs filesystem does not make Buffers grow, but doing the same
ls -lR against an ext3/4 makes Buffers grow quite a bit (how much
depends on how many files/directories are on the filesystem).

He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and
see if the poor performance correlates with Buffers suddenly being
smaller for some reason.