Re: System Hang With 5.15.79-rt54 Patch Set

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/18/23 13:52, Joseph Salisbury wrote:


On 1/4/23 15:47, Joseph Salisbury wrote:


On 12/15/22 18:39, Joseph Salisbury wrote:


On 12/15/22 13:39, Joseph Salisbury wrote:
Hello,

A bug has been reported[0] against a linux-5.15.y based Ubuntu kernel that has the 5.15.79-rt54 real-time patch set applied.

This bug is causing a system hang when running disk tests with the stress-ng[1] test suite.  A 'hung task' call trace is being generated[2] just prior to the system hanging.  It appears the hang is happening during a call to schedule.

To see if this was a regression, I also tested prior real-time patch sets, going as far back to rt38 and all versions exhibit this bug. I will test back further to see if this is in fact a regression and a kernel bisect can be done.  I also confirmed the same kernel without the rt54 patch set applied does not exhibit this bug.

I will continue to debug this issue, but I wanted to send this message upstream in case anyone else runs into it.  Any specific tests, tracing or debugging steps would be greatly appreciated. Otherwise, I'll report back to this thread with any findings.

Thanks,

Joe


[0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
[1] https://github.com/ColinIanKing/stress-ng
[2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt

I should have also mentioned, I will also test the latest available patch set (6.1-rc7-rt5) to see if it also exhibits the bug.


This bug/system hang does still happen with the v6.1-rc7 kernel and the rt5 real-time patch.  I also confirmed the bug/system hang does not happen with just v6.1-rc7 without the real-time patch applied.

I will bisect through the rt patches to see if I can find the specific patch(s) that causes the bug.  I will use the v5.15 kernel and it's rt patches, since v5.15 has the patches broken out individually, and they can be applied one at a time with 'git am'.

I'll provide further updates during the bisect.

Thanks,

Joe



I've been bisecting this bug down. It appears the bug has existed back to v5.15-rc3-rt5. I'll test older versions, but it may be that this bug has always existed with PREEMPT_RT enabled.

I will test some of the real-time patch releases prior to v5.15 to confirm this issue has always existed or not.

This is the current message I see on the console when the system locks up:

[ 4363.605933] INFO: task journal-offline:23780 blocked for more than 622 seconds.
[ 4363.605936]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.605938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606051] INFO: task stress-ng:23298 blocked for more than 622 seconds.
[ 4363.606052]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606121] INFO: task stress-ng:23299 blocked for more than 622 seconds.
[ 4363.606122]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606183] INFO: task stress-ng:23300 blocked for more than 622 seconds.
[ 4363.606184]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606235] INFO: task stress-ng:23302 blocked for more than 622 seconds.
[ 4363.606236]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606290] INFO: task stress-ng:23305 blocked for more than 622 seconds.
[ 4363.606291]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606346] INFO: task stress-ng:23309 blocked for more than 622 seconds.
[ 4363.606347]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606393] INFO: task stress-ng:23314 blocked for more than 622 seconds.
[ 4363.606394]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606440] INFO: task stress-ng:23319 blocked for more than 622 seconds.
[ 4363.606440]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4363.606484] INFO: task stress-ng:23323 blocked for more than 622 seconds.
[ 4363.606485]       Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4708.553134] systemd[1]: Failed to start Journal Service.
[ 4770.052389] systemd[1]: systemd-timesyncd.service: Watchdog timeout (limit 3min)!


I'll add more details to this thread as I continue.





A bisect has not provided additional detail. This issue does not appear to be a regression and appears to have always existed.

I was able to get additional debug info, by enabling CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG. Enabling these configs shows a circular locking issue[0] and a call trace[1].

I will analyze this data for the next steps.  Any other suggestions would be greatly appreciated.

Thanks,

Joe

[0] https://launchpadlibrarian.net/653810449/locking_issue.txt
[1] https://launchpadlibrarian.net/653810490/call_trace.txt





[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux