> On 2/16/23 12:15, Sebastian Andrzej Siewior wrote: > > On 2023-01-18 13:52:21 [-0500], Joseph Salisbury wrote: > >> I'll add more details to this thread as I continue. > > Any update on this? > > Does the system really hang? The dmesg says: > > |[ 8235.110075] INFO: task stress-ng:9466 blocked for more than 122 seconds. > > > > which means stress-ng is blocked for quite some time due to I/O > > according to the backtrace. This appears once for each stress-ng > > process, 10 times in total. It does not repeat and the system runs at > > least until > > > > | [50733.471625] hid-generic 0003:03F0:7029.0004: input,hidraw1: USB HID v1.10 Mouse [iLO Virtual Keyboard] on usb-0000:01:00.4-1/input1 > > > > ~11h after that report. > > Based on that it looks like the system complained about slow I/O but did > > not hang as it completed its task. > > > > Sebastian On Fri, Mar 10, 2023 at 1:09 PM Joseph Salisbury wrote: > A bisect has not provided additional detail. This issue does not appear > to be a regression and appears to have always existed. > > I was able to get additional debug info, by enabling > CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG. > Enabling these configs shows a circular locking issue[0] and a call > trace[1]. > > I don't think the circular locking report is related. I think your > correct that the system is not actually hanging. The interactive > response make it seem like it's hung. For example, once the issue > starts to happen, no other interactive commands can be issues without > taking at least days (I never waiting more that 3 days :-) ) I'm also > not able to log in or log out while the system "Appears" hung. I was > able to get a sysrq-W while the system was in this state[2]. > > I think I may have starting investigating too deep at first (By > bisecting and enabling trace, etc). I stepped back and looked at the > high level stats. The stress-ng test is started with one process for > each core, and there are 96 of them. I looked at top[3] during a hang, > and many of the stress-ng processes are running 'R'. However, a sysrq-q > also shows many stress-ng processes are 'D' in uninterruptible sleep. > What also sticks out to me is all the stress-ng processes are running as > root with a priority of 20. Looking back at one of the call traces[1], I > see jbd2 stuck in an uninterruptible state: > ... > [ 4461.908213] task:journal-offline state:D stack: 0 pid:17541 > ppid: 1 flags:0x00000226 > ... > > > The jdb2 kernel thread also runs with a priority of 20[4]. When the > hang happens, jbd2 is also stuck in an uninterruptible state(As well as > systemd-journal): > 1521 root 20 0 0 0 0 D 0.0 0.0 4:10.48 > jbd2/sda2-8 > 1593 root 19 -1 64692 15832 14512 D 0.0 0.1 0:01.54 > systemd-journal > > > > I don't yet know why running the test the same way for a generic kernel > does not cause this behavior when it does for a preempt-rt kernel. > Maybe it's a case of priority 'Sameness' and not priority inversion :-) ? > > I tried to pin all of the stress-ng threads to cores 1-95 and the kernel > threads to a housekeeping cpu, 0. I recall though that there are certain > kernel threads that need to run on every core and kworker is one of > them. Output from cmdline: > "BOOT_IMAGE=/boot/vmlinuz-5.15.0-1033-realtime > root=UUID=3583d8c4-d539-439f-9d50-4341675268cc ro console=tty0 > console=ttyS0,115200 skew_tick=1 isolcpus=managed_irq,domain,1-95 > intel_pstate=disable nosoftlockup tsc=nowatchdog > crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M" > > However, even with this pinning, stress-ng ends up running on cpu 0, per > the ps output[4]. This may be why it is interfering with jbd2. > > I'll see if I can modify the test to run as a non-root user or with a > lower priority. I could also try bumping the priority of jdb2. Maybe > one of these would allow the journal to complete it's work and the test > to finish? > > Could it be that that the system is not hung, it is just waiting to > complete I/O, which will never happen since the jdb2 threads are stuck. > In this case, this is not a bug, but a test that is not configured > correctly for a real-time system. Does that sound plausible? If you > think that is the case, I'll talk with the bug reporter and assist them > with running the test properly for a real-time system. Have you tried checking for low-memory during the test? Maybe the system is unable to write because of slow page-cache allocations (see https://www.socallinuxexpo.org/sites/default/files/presentations/Exploring%20%20Linux%20Memory%20Usage%20and%20%20Disk%20IO%20performance%20version%203.pdf) or perhaps there is massive inter-NUMA-node rebalancing going on in such a large system? Turning on CONFIG_PSI is a relatively easy way to monitor memory problems. Also, have you tried connecting to systemd-journald with GDB during the test, to see what it is doing? Or tried calculating if the bandwidth to your storage devices is simply maxed out? -- Alison Chaiken Aurora Innovation achaiken@xxxxxxxxxxx