Re: WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 11 Sep 2019 12:58:29 +0100

On Wed, Sep 11, 2019 at 11:26 AM Johannes Berg
<johannes@xxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> > So I'm at LCA
>
> When did LCA move to Portugal? ;-))

Heh. I may be jetlagged and not thinking straight.  LCA/LPC ;)

> > Previous resume looks normal:
> > [snip]
> >    wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
> > 54:ec:2f:05:70:2c
>
> Is that the message you meant?
>
> > I say _almost_, because I don't see the "No TX power limit" for the
> > country lookup (yes, Portugal) this time?
>
> because here you had it too, just a bit earlier. It usually comes when a
> beacon is received the first time, which depends on the AP timing.

Duh. I'm blind and didn't notice, because I was expecting it in the same order.

And I didn't think about it or double-check, because the errors that
then followed later _looked_ like that TX power failing that I thought
hadn't happened.

> I don't _think_ any of the above would be a reason to disconnect, but it
> clearly looks like the device got stuck at this point, since everything
> just fails afterwards.

Yeah, maybe the power stuff was just another effect of things getting
stuck, rather than the reason for it getting stuck.

So I probably mis-attributed the cause.

> >    WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
> > __sta_info_destroy_part2+0x147/0x150 [mac80211]
>
> Not really a surprise. Perhaps we shouldn't even WARN_ON() this, if the
> driver is stuck completely and returning errors to everything that
> doesn't help so much.
>
> Then again, the stack trace was helpful this time:
>
> >     ieee80211_set_disassoc+0xc2/0x590 [mac80211]
> >     ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
> >     cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
> >     cfg80211_mlme_down+0x66/0x90 [cfg80211]
> >     cfg80211_disconnect+0x129/0x1e0 [cfg80211]
> >     cfg80211_leave+0x27/0x40 [cfg80211]
> >     cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
> >     notifier_call_chain+0x4c/0x70
> >     __dev_close_many+0x57/0x100
> >     dev_close_many+0x8d/0x140
> >     dev_close.part.0+0x44/0x70
> >     cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
> >     cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
> >     rfkill_set_block+0x92/0x140 [rfkill]
> >     rfkill_fop_write+0x11f/0x1c0 [rfkill]
> >     vfs_write+0xb6/0x1a0
>
>
> Since we see that something actually did an rfkill operation. Did you
> push a button there?

No, I tried to turn off and turn on Wifi manually (no button, just the
settings panel).

I didn't notice the WARN_ON(), I just noticed that there was no
networking, and "turn it off and on again" is obviously the first
thing to try ;)

> You don't happen to have timing information on these logs, perhaps
> recorded in the logfile/journal?

Sure. I cleaned up the logs to not spam people with lots of illegible
data, but it's all in the journal log.

Rough timeline:

Sep 11 03:40:00 xps13 kernel: PM: suspend entry (s2idle)
Sep 11 03:40:00 xps13 kernel: Filesystems sync: 0.028 seconds
...
Sep 11 10:13:14 xps13 kernel: Restarting tasks ... done.
Sep 11 10:13:14 xps13 kernel: PM: suspend exit
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: UART prints enabled
Sep 11 10:13:14 xps13 kernel: ath10k_pci 0000:02:00.0: unsupported HTC
service id: 1536
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticate with 54:ec:2f:05:70:2c
Sep 11 10:13:23 xps13 kernel: wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
Sep 11 10:13:23 xps13 kernel: wlp2s0: authenticated
Sep 11 10:13:23 xps13 kernel: wlp2s0: Limiting TX power to 23 (23 - 0)
dBm as advertised by 54:ec:2f:05:70:2c
...
Sep 11 10:13:23 xps13 kernel: ath: regdomain 0x826c dynamically
updated by country element
Sep 11 10:13:24 xps13 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0:
link becomes ready
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: wmi command
16387 timeout, restarting hardware
...
Sep 11 10:27:07 xps13 kernel: ath10k_pci 0000:02:00.0: failed to read
hi_board_data address: -16
Sep 11 10:27:10 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
Sep 11 10:27:13 xps13 kernel: ath10k_pci 0000:02:00.0: failed to
receive initialized event from target: 00000000
...
Sep 11 10:27:13 xps13 kernel: WARNING: CPU: 4 PID: 1246 at
net/mac80211/sta_info.c:1057 __sta_info_destroy_part2+0x147/0x150
[mac80211]

but if you want full logs I can send them in private to you.

I do suspect it's atheros and suspend/resume or something. The
wireless clearly worked for a while after the resume, but then at some
point it stopped.

> It seems odd to me, since the RTNL is acquired by
> cfg80211_rfkill_set_block() and that doesn't even have an error path, it
> just does
>         rtnl_lock();
>         cfg80211_shutdown_all_interfaces(&rdev->wiphy);
>         rtnl_unlock();
>
> The only explanation I therefore have is that something is just taking
> *forever* in that code path, hence my question about timing information
> on the logs.

Yeah, maybe it would time out everything eventually. But not for a
long time. It hadn't cleared up by

  Sep 11 10:36:21 xps13 gnome-session-f[6837]: gnome-session-failed:
Fatal IO error 0 (Success) on X server :0.

which is when I shut down the machine (and had to then force a hard
power-off because the shutdown wanted things that needed the rtnl_lock
to go away)

                 Linus