Search Linux Wireless

Re: Kernel deadlock in 6.7.5 + hacks, maybe debugfs related.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2024-02-26 at 06:58 -0800, Ben Greear wrote:

> I managed to catch it in a lockdep enabled kernel.  There were no lockdep
> warnings, but more lock info was printed at least:
> 
> Maybe ordering of this one's lock printout means it is 'ip' that is actually holding the rtnl?
> 
> [Seen in context further below]
> Feb 26 06:01:45 ct523c-0b0b kernel: 3 locks held by ip/28125:
> Feb 26 06:01:45 ct523c-0b0b kernel:  #0: ffffffff82a6d588 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x13e/0x620
> Feb 26 06:01:45 ct523c-0b0b kernel:  #1: ffff888122e60768 (&rdev->wiphy.mtx){+.+.}-{4:4}, at: ieee80211_stop+0x2d/0x180 [mac80211]
> Feb 26 06:01:45 ct523c-0b0b kernel:  #2: ffff88812661d768 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: simple_recursive_removal+0x189/0x2b0

Yes, probably.

> Feb 26 06:01:45 ct523c-0b0b kernel: INFO: task kworker/6:3:542 blocked for more than 180 seconds.
> Feb 26 06:01:45 ct523c-0b0b kernel:       Not tainted 6.7.5+ #1
> Feb 26 06:01:45 ct523c-0b0b kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 26 06:01:45 ct523c-0b0b kernel: task:kworker/6:3     state:D stack:0     pid:542   tgid:542   ppid:2      flags:0x00004000
> Feb 26 06:01:45 ct523c-0b0b kernel: Workqueue: ipv6_addrconf addrconf_verify_work
> Feb 26 06:01:45 ct523c-0b0b kernel: Call Trace:
> Feb 26 06:01:45 ct523c-0b0b kernel:  <TASK>
> Feb 26 06:01:45 ct523c-0b0b kernel:  __schedule+0x42c/0xde0
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule+0x3c/0x120
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule_preempt_disabled+0xc/0x20
> Feb 26 06:01:45 ct523c-0b0b kernel:  __mutex_lock+0x5c4/0xc00
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? lock_acquire+0xc6/0x2b0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? addrconf_verify_work+0x9/0x20
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? addrconf_verify_work+0x9/0x20
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  addrconf_verify_work+0x9/0x20

Lots of things like this are just blocked on RTNL.

> Feb 26 06:01:45 ct523c-0b0b kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 26 06:01:45 ct523c-0b0b kernel: task:kworker/u16:1   state:D stack:0     pid:8298  tgid:8298  ppid:2      flags:0x00004000
> Feb 26 06:01:45 ct523c-0b0b kernel: Workqueue: events_unbound cfg80211_wiphy_work [cfg80211]
> Feb 26 06:01:45 ct523c-0b0b kernel: Call Trace:
> Feb 26 06:01:45 ct523c-0b0b kernel:  <TASK>
> Feb 26 06:01:45 ct523c-0b0b kernel:  __schedule+0x42c/0xde0
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule+0x3c/0x120
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule_preempt_disabled+0xc/0x20
> Feb 26 06:01:45 ct523c-0b0b kernel:  __mutex_lock+0x5c4/0xc00
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? cfg80211_wiphy_work+0x1a/0xc0 [cfg80211]
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? cfg80211_wiphy_work+0x1a/0xc0 [cfg80211]
> Feb 26 06:01:45 ct523c-0b0b kernel:  cfg80211_wiphy_work+0x1a/0xc0 [cfg80211]

this is blocked on wiphy mutex.

> Feb 26 06:01:45 ct523c-0b0b kernel: INFO: task ip:28125 blocked for more than 180 seconds.
> Feb 26 06:01:45 ct523c-0b0b kernel:       Not tainted 6.7.5+ #1
> Feb 26 06:01:45 ct523c-0b0b kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 26 06:01:45 ct523c-0b0b kernel: task:ip              state:D stack:0     pid:28125 tgid:28125 ppid:3604   flags:0x00004002
> Feb 26 06:01:45 ct523c-0b0b kernel: Call Trace:
> Feb 26 06:01:45 ct523c-0b0b kernel:  <TASK>
> Feb 26 06:01:45 ct523c-0b0b kernel:  __schedule+0x42c/0xde0
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule+0x3c/0x120
> Feb 26 06:01:45 ct523c-0b0b kernel:  schedule_timeout+0x19c/0x1b0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? mark_held_locks+0x49/0x70
> Feb 26 06:01:45 ct523c-0b0b kernel:  __wait_for_common+0xba/0x1d0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? usleep_range_state+0xb0/0xb0
> Feb 26 06:01:45 ct523c-0b0b kernel:  remove_one+0x6b/0x100
> Feb 26 06:01:45 ct523c-0b0b kernel:  simple_recursive_removal+0x20c/0x2b0
> Feb 26 06:01:45 ct523c-0b0b kernel:  ? start_creating.part.0+0x170/0x170
> Feb 26 06:01:45 ct523c-0b0b kernel:  debugfs_remove+0x3b/0x60
> Feb 26 06:01:45 ct523c-0b0b kernel:  ieee80211_debugfs_recreate_netdev+0x30/0xac0 [mac80211]

This seems to be blocked on the wait_for_completion()?! Which means it
should have actually triggered the c->cancel() first, but ... if
wiphy_locked_debugfs_read_cancel() had actually been called, I'm not
sure why we'd be stuck here?

Then again, I'm looking at the code now and not sure why I put this

+       /* if we're already removing wake it up to cancel */
+       if (d_unlinked(dentry))
+               complete(&fsd->active_users_drained);

into debugfs_enter_cancellation()...




> Feb 26 06:01:45 ct523c-0b0b kernel:
>                                      Showing all locks held in the system:

This isn't always clear on holding vs. acquiring, FWIW.

> Feb 26 06:01:45 ct523c-0b0b kernel: 3 locks held by kworker/6:3/542:
> Feb 26 06:01:45 ct523c-0b0b kernel:  #0: ffff888109b33b38 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  #1: ffffc90000f77e58 ((work_completion)(&(&net->ipv6.addr_chk_work)->work)){+.+.}-{0:0}, at: process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  #2: ffffffff82a6d588 (rtnl_mutex){+.+.}-{4:4}, at: addrconf_verify_work+0x9/0x20

See this, it's actually stuck on acquiring the RTNL.

> Feb 26 06:01:45 ct523c-0b0b kernel: 3 locks held by kworker/u16:1/8298:
> Feb 26 06:01:45 ct523c-0b0b kernel:  #0: ffff888108058d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  #1: ffffc9000abcfe58 ((work_completion)(&rdev->wiphy_work)){+.+.}-{0:0}, at: process_one_work+0x18f/0x4e0
> Feb 26 06:01:45 ct523c-0b0b kernel:  #2: ffff888122e60768 (&rdev->wiphy.mtx){+.+.}-{4:4}, at: cfg80211_wiphy_work+0x1a/0xc0 [cfg80211]

And this is stuck trying to acquire the wiphy mutex.

> Feb 26 06:01:45 ct523c-0b0b kernel: 3 locks held by ip/28125:
> Feb 26 06:01:45 ct523c-0b0b kernel:  #0: ffffffff82a6d588 (rtnl_mutex){+.+.}-{4:4}, at: rtnetlink_rcv_msg+0x13e/0x620
> Feb 26 06:01:45 ct523c-0b0b kernel:  #1: ffff888122e60768 (&rdev->wiphy.mtx){+.+.}-{4:4}, at: ieee80211_stop+0x2d/0x180 [mac80211]
> Feb 26 06:01:45 ct523c-0b0b kernel:  #2: ffff88812661d768 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: simple_recursive_removal+0x189/0x2b0

This has them all.

johannes





[Index of Archives]     [Linux Host AP]     [ATH6KL]     [Linux Wireless Personal Area Network]     [Linux Bluetooth]     [Wireless Regulations]     [Linux Netdev]     [Kernel Newbies]     [Linux Kernel]     [IDE]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Hiking]     [MIPS Linux]     [ARM Linux]     [Linux RAID]

  Powered by Linux