On Fri, 2024-02-23 at 09:08 -0800, Ben Greear wrote: > [905261.271981] INFO: task systemd-resolve:948 blocked for more than 180 seconds. > [905261.277834] Tainted: G O 6.7.3+ #34 > [905261.281939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [905261.288474] task:systemd-resolve state:D stack:0 pid:948 tgid:948 ppid:1 flags:0x00000002 I don't know what that's doing, but > [905261.288477] Call Trace: > [905261.288478] <TASK> > [905261.288481] __schedule+0x377/0xa10 > [905261.288486] ? _copy_to_user+0x20/0x30 > [905261.288507] schedule+0x29/0xc0 > [905261.288509] schedule_preempt_disabled+0xc/0x20 > [905261.288511] __mutex_lock.constprop.0+0x34d/0x760 > [905261.288514] do_ip_setsockopt+0x4a/0x1150 I'm going to assume that this is RTNL, the socket lock isn't technically a mutex, and doesn't seem reasonable in this context. > [905261.337114] task:xdg-desktop-por state:D stack:0 pid:2485 tgid:2485 ppid:1742 flags:0x00000002 > [905261.337116] Call Trace: > [905261.337117] <TASK> > [905261.337118] __schedule+0x377/0xa10 > [905261.337120] ? poll_freewait+0x40/0xa0 > [905261.337122] schedule+0x29/0xc0 > [905261.337138] schedule_preempt_disabled+0xc/0x20 > [905261.337139] __mutex_lock.constprop.0+0x34d/0x760 > [905261.337142] __netlink_dump_start+0x6c/0x280 > [905261.337143] ? rtnl_mdb_dump+0x160/0x160 > [905261.337146] rtnetlink_rcv_msg+0x267/0x3b0 A lot of other things are waiting for the RTNL too. > [905261.353300] INFO: task btserver:3593 blocked for more than 180 seconds. > [905261.358620] Tainted: G O 6.7.3+ #34 > [905261.362729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [905261.369263] task:btserver state:D stack:0 pid:3593 tgid:3593 ppid:2866 flags:0x00000002 > [905261.369265] Call Trace: > [905261.369266] <TASK> > [905261.369267] __schedule+0x377/0xa10 > [905261.369268] ? sched_clock+0xc/0x20 > [905261.369270] schedule+0x29/0xc0 > [905261.369271] schedule_timeout+0x136/0x150 > [905261.369274] wait_for_completion+0x7d/0x130 > [905261.369292] wiphy_locked_debugfs_read+0x114/0x180 [cfg80211] > [905261.369363] ? wiphy_locked_debugfs_write_work+0x30/0x30 [cfg80211] > [905261.369400] ? rts_threshold_read+0x80/0x80 [cfg80211] > [905261.369438] ? ieee80211_link_sta_debugfs_drv_remove+0x40/0x40 [mac80211] > [905261.369516] ieee80211_if_read_sdata+0x55/0x80 [mac80211] This you saw - it's waiting for the completion of the read of the ... not sure which file? Any chance you could find out which fd this process has open when it happens? > [905261.369553] ? ieee80211_link_sta_debugfs_drv_remove+0x40/0x40 [mac80211] > [905261.369587] ? ieee80211_if_fmt_dot11MeshMaxRetries+0x20/0x20 [mac80211] I don't think we can trust any of the lines marked "?", they're probably old prior stack contents ... these don't show up in the call stack of ieee80211_if_read_sdata() directly, after all. > [905261.402921] INFO: task kworker/u16:17:2582370 blocked for more than 180 seconds. > [905261.409024] Tainted: G O 6.7.3+ #34 > [905261.413157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [905261.419735] task:kworker/u16:17 state:D stack:0 pid:2582370 tgid:2582370 ppid:2 flags:0x00004000 > [905261.419738] Workqueue: events_unbound cfg80211_wiphy_work [cfg80211] > [905261.419788] Call Trace: > [905261.419788] <TASK> > [905261.419789] __schedule+0x377/0xa10 > [905261.419792] schedule+0x29/0xc0 > [905261.419793] schedule_preempt_disabled+0xc/0x20 > [905261.419795] __mutex_lock.constprop.0+0x34d/0x760 > [905261.419797] cfg80211_wiphy_work+0x15/0xb0 [cfg80211] > [905261.419827] process_one_work+0x15d/0x330 This is a bit interesting - why is the wiphy work stuck trying to acquire the wiphy mutex? But nothing related to the RTNL. This does explain why the debugfs is stuck though. So maybe doesn't matter knowing which file got stuck ... > [905261.436449] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings I think we're missing the process that's actually stuck though, maybe here or later? Seems like there must be a process somewhere that holds RTNL and wiphy mutex, but didn't get listed here. Maybe it's live-locked? Maybe it's actually a netdev removal (which would hold both those locks), but then why is it stuck and not shown? johannes