Dr. David Alan Gilbert (dave@xxxxxxxxxxx) wrote on Mon, Aug 14, 2023 at 06:02:53PM -03: > I'm seeing a few hangs on a fs after upgrading to fedora 39's bleeding > edge; which is running kernel 6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39.x86_64 > It was always solid prior to that. It seems to trigger on heavy IO > on this fs. Good news! No, I didn't forget the smiley... Maybe now the problem has become sufficiently bad to be visible/solvable... 6.4.* also doesn't run in one of our machines, which has heavy I/O load. The first symptom is that rsync downloads hang and abort with timeout. 1 or 2 days later the amount of modified pages waiting to go to disk reaches several GB, as reported by /proc/meminfo, but disks remain idle. Finally reading from the arrays collapses. This is just the worst case. Since early 5.* I/O performance has dropped absurdly. In all our disk servers this is easy to see: just generate lots of writes quickly (for example expanding a kernel tarball). Using top I see that kworker starts using 100% cpu but disks stay idle (as seen by dstat or sar). If you do a sync or umount it takes looooong to reach ~0 modified pages for the sync or umount to return. In the server I mentioned above where 6.4.* don't stand the load, which is one of the largest free software mirrors of the world, even sometimes 6.1 collapses: I/O becomes so slow that service (apache) stops. The problem gets progressively worse with time after booting. It's hardly noticeable in the first hour after boot, and easily seen after ~3-4 days of uptime. The higher the (write) I/O load the faster it appears. All this is with ext4 and raid6 with >~ 14 disks in the arrays. I don't have debug info because these are production machines and I only compile in the kernel the bare minimum essential for operation. It's always pure kernel.org releases; gcc versions vary, for 6.4* it's gcc-13, for 6.1* gcc-12 is used, on Debian unstable updated more than 4 times/week.