Re: System Hang With 5.15.79-rt54 Patch Set

Alison Chaiken <achaiken@xxxxxxxxxxx> · Mon, 13 Mar 2023 08:11:05 -0700

> On 2/16/23 12:15, Sebastian Andrzej Siewior wrote:
> > On 2023-01-18 13:52:21 [-0500], Joseph Salisbury wrote:
> >> I'll add more details to this thread as I continue.
> > Any update on this?
> > Does the system really hang? The dmesg says:
> > |[ 8235.110075] INFO: task stress-ng:9466 blocked for more than 122 seconds.
> >
> > which means stress-ng is blocked for quite some time due to I/O
> > according to the backtrace. This appears once for each stress-ng
> > process, 10 times in total. It does not repeat and the system runs at
> > least until
> >
> > | [50733.471625] hid-generic 0003:03F0:7029.0004: input,hidraw1: USB HID v1.10 Mouse [iLO Virtual Keyboard] on usb-0000:01:00.4-1/input1
> >
> > ~11h after that report.
> > Based on that it looks like the system complained about slow I/O but did
> > not hang as it completed its task.
> >
> > Sebastian

On Fri, Mar 10, 2023 at 1:09 PM Joseph Salisbury  wrote:
> A bisect has not provided additional detail. This issue does not appear
> to be a regression and appears to have always existed.
>
> I was able to get additional debug info, by enabling
> CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG.
> Enabling these configs shows a circular locking issue[0] and a call
> trace[1].
>
> I don't think the circular locking report is related.  I think your
> correct that the system is not actually hanging.  The interactive
> response make it seem like it's hung.  For example, once the issue
> starts to happen, no other interactive commands can be issues without
> taking at least days (I never waiting more that 3 days :-) )  I'm also
> not able to log in or log out while the system "Appears" hung.  I was
> able to get a sysrq-W while the system was in this state[2].
>
> I think I may have starting investigating too deep at first (By
> bisecting and enabling trace, etc).  I stepped back and looked at the
> high level stats.  The stress-ng test is started with one process for
> each core, and there are 96 of them.  I looked at top[3] during a hang,
> and many of the stress-ng processes are running 'R'. However, a sysrq-q
> also shows many stress-ng processes are 'D' in uninterruptible sleep.
> What also sticks out to me is all the stress-ng processes are running as
> root with a priority of 20. Looking back at one of the call traces[1], I
> see jbd2 stuck in an uninterruptible state:
> ...
> [ 4461.908213] task:journal-offline state:D stack:    0 pid:17541
> ppid:     1 flags:0x00000226
> ...
>
>
> The jdb2 kernel thread also runs with a priority of 20[4].  When the
> hang happens, jbd2 is also stuck in an uninterruptible state(As well as
> systemd-journal):
> 1521 root      20   0       0      0      0 D   0.0   0.0   4:10.48
> jbd2/sda2-8
> 1593 root      19  -1   64692  15832  14512 D   0.0   0.1   0:01.54
> systemd-journal
>
>
>
> I don't yet know why running the test the same way for a generic kernel
> does not cause this behavior when it does for a preempt-rt kernel.
> Maybe it's a case of priority 'Sameness' and not priority inversion :-) ?
>
> I tried to pin all of the stress-ng threads to cores 1-95 and the kernel
> threads to a housekeeping cpu, 0. I recall though that there are certain
> kernel threads that need to run on every core and kworker is one of
> them.  Output from cmdline:
> "BOOT_IMAGE=/boot/vmlinuz-5.15.0-1033-realtime
> root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0
> console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,1-95
> intel_pstate=disable nosoftlockup tsc=nowatchdog
> crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M"
>
> However, even with this pinning, stress-ng ends up running on cpu 0, per
> the ps output[4].  This may be why it is interfering with jbd2.
>
> I'll see if I can modify the test to run as a non-root user or with a
> lower priority.  I could also try bumping the priority of jdb2. Maybe
> one of these would allow the journal to complete it's work and the test
> to finish?
>
> Could it be that that the system is not hung, it is just waiting to
> complete I/O, which will never happen since the jdb2 threads are stuck.
> In this case, this is not a bug, but a test that is not configured
> correctly for a real-time system.  Does that sound plausible?  If you
> think that is the case, I'll talk with the bug reporter and assist them
> with running the test properly for a real-time system.

Have you tried checking for low-memory during the test?   Maybe the
system is unable to write because of slow page-cache allocations (see
https://www.socallinuxexpo.org/sites/default/files/presentations/Exploring%20%20Linux%20Memory%20Usage%20and%20%20Disk%20IO%20performance%20version%203.pdf)
or perhaps there is massive inter-NUMA-node rebalancing going on in
such a large system?      Turning on CONFIG_PSI is a relatively easy
way to monitor memory problems.   Also, have you tried connecting to
systemd-journald with GDB during the test, to see what it is doing?
Or tried calculating if the bandwidth to your storage devices is
simply maxed out?

-- Alison Chaiken
Aurora Innovation
achaiken@xxxxxxxxxxx