On 1/18/23 13:52, Joseph Salisbury wrote:
On 1/4/23 15:47, Joseph Salisbury wrote:
On 12/15/22 18:39, Joseph Salisbury wrote:
On 12/15/22 13:39, Joseph Salisbury wrote:
Hello,
A bug has been reported[0] against a linux-5.15.y based Ubuntu
kernel that has the 5.15.79-rt54 real-time patch set applied.
This bug is causing a system hang when running disk tests with the
stress-ng[1] test suite. A 'hung task' call trace is being
generated[2] just prior to the system hanging. It appears the hang
is happening during a call to schedule.
To see if this was a regression, I also tested prior real-time
patch sets, going as far back to rt38 and all versions exhibit this
bug. I will test back further to see if this is in fact a
regression and a kernel bisect can be done. I also confirmed the
same kernel without the rt54 patch set applied does not exhibit
this bug.
I will continue to debug this issue, but I wanted to send this
message upstream in case anyone else runs into it. Any specific
tests, tracing or debugging steps would be greatly appreciated.
Otherwise, I'll report back to this thread with any findings.
Thanks,
Joe
[0] https://bugs.launchpad.net/ubuntu-realtime/+bug/1998536
[1] https://github.com/ColinIanKing/stress-ng
[2] https://launchpadlibrarian.net/637058129/dmesg_dl110_rt_kernel.txt
I should have also mentioned, I will also test the latest available
patch set (6.1-rc7-rt5) to see if it also exhibits the bug.
This bug/system hang does still happen with the v6.1-rc7 kernel and
the rt5 real-time patch. I also confirmed the bug/system hang does
not happen with just v6.1-rc7 without the real-time patch applied.
I will bisect through the rt patches to see if I can find the
specific patch(s) that causes the bug. I will use the v5.15 kernel
and it's rt patches, since v5.15 has the patches broken out
individually, and they can be applied one at a time with 'git am'.
I'll provide further updates during the bisect.
Thanks,
Joe
I've been bisecting this bug down. It appears the bug has existed back
to v5.15-rc3-rt5. I'll test older versions, but it may be that this
bug has always existed with PREEMPT_RT enabled.
I will test some of the real-time patch releases prior to v5.15 to
confirm this issue has always existed or not.
This is the current message I see on the console when the system locks
up:
[ 4363.605933] INFO: task journal-offline:23780 blocked for more than
622 seconds.
[ 4363.605936] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.605938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606051] INFO: task stress-ng:23298 blocked for more than 622
seconds.
[ 4363.606052] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606121] INFO: task stress-ng:23299 blocked for more than 622
seconds.
[ 4363.606122] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606183] INFO: task stress-ng:23300 blocked for more than 622
seconds.
[ 4363.606184] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606235] INFO: task stress-ng:23302 blocked for more than 622
seconds.
[ 4363.606236] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606290] INFO: task stress-ng:23305 blocked for more than 622
seconds.
[ 4363.606291] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606346] INFO: task stress-ng:23309 blocked for more than 622
seconds.
[ 4363.606347] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606393] INFO: task stress-ng:23314 blocked for more than 622
seconds.
[ 4363.606394] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606440] INFO: task stress-ng:23319 blocked for more than 622
seconds.
[ 4363.606440] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606441] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4363.606484] INFO: task stress-ng:23323 blocked for more than 622
seconds.
[ 4363.606485] Not tainted 5.15.85-051585-realtime #202301172024
[ 4363.606486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 4708.553134] systemd[1]: Failed to start Journal Service.
[ 4770.052389] systemd[1]: systemd-timesyncd.service: Watchdog timeout
(limit 3min)!
I'll add more details to this thread as I continue.
A bisect has not provided additional detail. This issue does not appear
to be a regression and appears to have always existed.
I was able to get additional debug info, by enabling
CONFIG_DEBUG_PREEMPT, CONFIG_PROVE_LOCKING and CONFIG_JBD2_DEBUG.
Enabling these configs shows a circular locking issue[0] and a call
trace[1].
I will analyze this data for the next steps. Any other suggestions
would be greatly appreciated.
Thanks,
Joe
[0] https://launchpadlibrarian.net/653810449/locking_issue.txt
[1] https://launchpadlibrarian.net/653810490/call_trace.txt