Search Linux Wireless

Re: Kernel deadlock in 6.7.5 + hacks, maybe debugfs related.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2024-02-23 at 09:08 -0800, Ben Greear wrote:

> [905261.271981] INFO: task systemd-resolve:948 blocked for more than 180 seconds.
> [905261.277834]       Tainted: G           O       6.7.3+ #34
> [905261.281939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [905261.288474] task:systemd-resolve state:D stack:0     pid:948   tgid:948   ppid:1      flags:0x00000002

I don't know what that's doing, but

> [905261.288477] Call Trace:
> [905261.288478]  <TASK>
> [905261.288481]  __schedule+0x377/0xa10
> [905261.288486]  ? _copy_to_user+0x20/0x30
> [905261.288507]  schedule+0x29/0xc0
> [905261.288509]  schedule_preempt_disabled+0xc/0x20
> [905261.288511]  __mutex_lock.constprop.0+0x34d/0x760
> [905261.288514]  do_ip_setsockopt+0x4a/0x1150

I'm going to assume that this is RTNL, the socket lock isn't technically
a mutex, and doesn't seem reasonable in this context.

> [905261.337114] task:xdg-desktop-por state:D stack:0     pid:2485  tgid:2485  ppid:1742   flags:0x00000002
> [905261.337116] Call Trace:
> [905261.337117]  <TASK>
> [905261.337118]  __schedule+0x377/0xa10
> [905261.337120]  ? poll_freewait+0x40/0xa0
> [905261.337122]  schedule+0x29/0xc0
> [905261.337138]  schedule_preempt_disabled+0xc/0x20
> [905261.337139]  __mutex_lock.constprop.0+0x34d/0x760
> [905261.337142]  __netlink_dump_start+0x6c/0x280
> [905261.337143]  ? rtnl_mdb_dump+0x160/0x160
> [905261.337146]  rtnetlink_rcv_msg+0x267/0x3b0

A lot of other things are waiting for the RTNL too.

> [905261.353300] INFO: task btserver:3593 blocked for more than 180 seconds.
> [905261.358620]       Tainted: G           O       6.7.3+ #34
> [905261.362729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [905261.369263] task:btserver        state:D stack:0     pid:3593  tgid:3593  ppid:2866   flags:0x00000002
> [905261.369265] Call Trace:
> [905261.369266]  <TASK>
> [905261.369267]  __schedule+0x377/0xa10
> [905261.369268]  ? sched_clock+0xc/0x20
> [905261.369270]  schedule+0x29/0xc0
> [905261.369271]  schedule_timeout+0x136/0x150
> [905261.369274]  wait_for_completion+0x7d/0x130
> [905261.369292]  wiphy_locked_debugfs_read+0x114/0x180 [cfg80211]
> [905261.369363]  ? wiphy_locked_debugfs_write_work+0x30/0x30 [cfg80211]
> [905261.369400]  ? rts_threshold_read+0x80/0x80 [cfg80211]
> [905261.369438]  ? ieee80211_link_sta_debugfs_drv_remove+0x40/0x40 [mac80211]
> [905261.369516]  ieee80211_if_read_sdata+0x55/0x80 [mac80211]

This you saw - it's waiting for the completion of the read of the ...
not sure which file? Any chance you could find out which fd this process
has open when it happens?

> [905261.369553]  ? ieee80211_link_sta_debugfs_drv_remove+0x40/0x40 [mac80211]
> [905261.369587]  ? ieee80211_if_fmt_dot11MeshMaxRetries+0x20/0x20 [mac80211]

I don't think we can trust any of the lines marked "?", they're probably
old prior stack contents ... these don't show up in the call stack of
ieee80211_if_read_sdata() directly, after all.

> [905261.402921] INFO: task kworker/u16:17:2582370 blocked for more than 180 seconds.
> [905261.409024]       Tainted: G           O       6.7.3+ #34
> [905261.413157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [905261.419735] task:kworker/u16:17  state:D stack:0     pid:2582370 tgid:2582370 ppid:2      flags:0x00004000
> [905261.419738] Workqueue: events_unbound cfg80211_wiphy_work [cfg80211]
> [905261.419788] Call Trace:
> [905261.419788]  <TASK>
> [905261.419789]  __schedule+0x377/0xa10
> [905261.419792]  schedule+0x29/0xc0
> [905261.419793]  schedule_preempt_disabled+0xc/0x20
> [905261.419795]  __mutex_lock.constprop.0+0x34d/0x760
> [905261.419797]  cfg80211_wiphy_work+0x15/0xb0 [cfg80211]
> [905261.419827]  process_one_work+0x15d/0x330

This is a bit interesting - why is the wiphy work stuck trying to
acquire the wiphy mutex? But nothing related to the RTNL.

This does explain why the debugfs is stuck though. So maybe doesn't
matter knowing which file got stuck ...

> [905261.436449] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

I think we're missing the process that's actually stuck though, maybe
here or later?

Seems like there must be a process somewhere that holds RTNL and wiphy
mutex, but didn't get listed here. Maybe it's live-locked? Maybe it's
actually a netdev removal (which would hold both those locks), but then
why is it stuck and not shown?

johannes





[Index of Archives]     [Linux Host AP]     [ATH6KL]     [Linux Wireless Personal Area Network]     [Linux Bluetooth]     [Wireless Regulations]     [Linux Netdev]     [Kernel Newbies]     [Linux Kernel]     [IDE]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Hiking]     [MIPS Linux]     [ARM Linux]     [Linux RAID]

  Powered by Linux