On Fri, Nov 3, 2023 at 9:17 AM Carlos Carvalho <carlos@xxxxxxxxxxxxxx> wrote: > > Johannes Truschnigg (johannes@xxxxxxxxxxxxxxx) wrote on Thu, Nov 02, 2023 at 05:34:51AM -03: > > for the record, I do not think that any of the observations the OP made can be > > explained by non-pathological phenomena/patterns of behavior. Something is > > very clearly wrong with how this system behaves (the reported figures do not > > at all match the expected performance of even a degraded RAID6 array in my > > experience) and how data written to the filesystem apparently fails to make it > > into the backing devices in acceptable time. > > > > The whole affair reeks either of "subtle kernel bug", or maybe "subtle > > hardware failure", I think. > > Exactly. That's what I've been saying for months... > > I found a clear comparison: expanding the kernel tarball in the SAME MACHINE > with 6.1.61 and 6.5.10. The raid6 array is working normally in both cases. With > 6.1.61 the expansion works fine, finishes with ~100MB of dirty pages and these > are quickly sent to permanent storage. With 6.5.* it finishes with ~1.5GB of > dirty pages that are never sent to disk (I waited ~3h). The disks are idle, as > shown by sar, and the kworker/flushd runs with 100% cpu usage forever. > > Limiting the dirty*bytes in /proc/sys/vm the dirty pages stay low BUT tar is > blocked in D state and the tarball expansion proceeds so slowly that it'd take > days to complete (checked with quota). > > So 6.5 (and 6.4) are unusable in this case. In another machine, which does > hundreds of rsync downloads every day, the same problem exists and I also get > frequent random rsync timeouts. > > This is all with raid6 and ext4. One of the machines has a journal disk in the > raid and the filesystem is mounted with nobarriers. Both show the same > behavior. It'd be interesting to try a different filesystem but these are > production machines with many disks and I cannot create another big array to > transfer the contents. My array is running 6.5 + xfs, and mine all seems to work normally (speed wise). And in the perf top call he ran all of the busy kworkers were ext4* calls spending a lot of time doing various filesystem work. I did find/debug a situation where dumping the cache caused ext4 performance to be a disaster (large directories, lots of files). It was tracked back to ext4 relies on the Buffers: data space in /proc/meminfo for at least directory entry caching, and that if there were a lot of directories and/or files in directories that Buffer: getting dropped and/or getting pruned for any some reason caused the fragmented directory entries to have to get reloaded from a spinning disk and require the disk to be seeking for *MINUTES* to reload it (there were in this case several million files in a couple of directories with the directory entries being allocated over time so very likely heavily fragmented). I wonder if there was some change with how Buffers is used/sized/pruned in the recent kernels. The same drop_cache on an XFS filesystem had no effect that I could identify and doing a ls -lR on a big xfs filesystem does not make Buffers grow, but doing the same ls -lR against an ext3/4 makes Buffers grow quite a bit (how much depends on how many files/directories are on the filesystem). He may want to monitor buffers (cat /proc/meminfo | grep Buffers:) and see if the poor performance correlates with Buffers suddenly being smaller for some reason.