Hi Johannes, I'm running OpenWRT on an AP device and noticed following splat when a fair amount of clients was connected to it (roughly 20) Do you think this one is affiliated with this patch? Thanks! [ 45.032106] ------------[ cut here ]------------ [ 45.032441] WARNING: CPU: 2 PID: 1654 at backports-6.1.24/net/mac80211/driver-ops.h:611 __ieee80211_flush_queues+0x168/0x16c [mac80211] [ 45.036137] wlan1.sta4: Failed check-sdata-in-driver check, flags: 0x1 [ 45.048585] Modules linked in: nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet iptable_nat ath10k_pci ath10k_core ath xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD wireguard nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mac80211 libchacha20poly1305 iptable_mangle iptable_filter ipt_REJECT ip_tables curve25519_neon cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG x_tables poly1305_arm nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 mbt libcurve25519_generic libcrc32c hwmon compat chacha_neon ip_gre gre dummy ip6_udp_tunnel udp_tunnel ip_tunnel tun dns_resolver sha512_arm ghash_arm_ce cmac leds_gpio xhci_plat_hcd xhci_pci xhci_hcd [ 45.055080] dwc3 dwc3_qcom gpio_button_hotplug crc32c_generic [ 45.141905] CPU: 2 PID: 1654 Comm: hostapd Not tainted 5.15.127 #0 [ 45.147557] Hardware name: Generic DT based system [ 45.153753] [<c030d480>] (unwind_backtrace) from [<c0309848>] (show_stack+0x10/0x14) [ 45.158523] [<c0309848>] (show_stack) from [<c05dd924>] (dump_stack_lvl+0x40/0x4c) [ 45.166414] [<c05dd924>] (dump_stack_lvl) from [<c0322540>] (__warn+0x8c/0x100) [ 45.173792] [<c0322540>] (__warn) from [<c032261c>] (warn_slowpath_fmt+0x68/0x78) [ 45.180996] [<c032261c>] (warn_slowpath_fmt) from [<bf25b63c>] (__ieee80211_flush_queues+0x168/0x16c [mac80211]) [ 45.188780] [<bf25b63c>] (__ieee80211_flush_queues [mac80211]) from [<bf228ec8>] (sta_set_sinfo+0xd44/0xe64 [mac80211]) [ 45.198951] [<bf228ec8>] (sta_set_sinfo [mac80211]) from [<bf229230>] (sta_info_destroy_addr_bss+0x44/0x5c [mac80211]) [ 45.209456] [<bf229230>] (sta_info_destroy_addr_bss [mac80211]) from [<bf1a5000>] (nl80211_del_station+0xe0/0x2b4 [cfg80211]) [ 45.220271] [<bf1a5000>] (nl80211_del_station [cfg80211]) from [<c080a818>] (genl_rcv_msg+0x154/0x340) [ 45.231538] [<c080a818>] (genl_rcv_msg) from [<c08098e8>] (netlink_rcv_skb+0xb8/0x11c) [ 45.240718] [<c08098e8>] (netlink_rcv_skb) from [<c0809f24>] (genl_rcv+0x28/0x34) [ 45.248619] [<c0809f24>] (genl_rcv) from [<c0808fb0>] (netlink_unicast+0x174/0x26c) [ 45.256170] [<c0808fb0>] (netlink_unicast) from [<c0809284>] (netlink_sendmsg+0x1dc/0x440) [ 45.263641] [<c0809284>] (netlink_sendmsg) from [<c0782878>] (____sys_sendmsg+0x1d0/0x224) [ 45.271974] [<c0782878>] (____sys_sendmsg) from [<c0784308>] (___sys_sendmsg+0xa4/0xdc) [ 45.280220] [<c0784308>] (___sys_sendmsg) from [<c0784470>] (sys_sendmsg+0x44/0x74) [ 45.288110] [<c0784470>] (sys_sendmsg) from [<c0300040>] (ret_fast_syscall+0x0/0x48) [ 45.295740] Exception stack(0xc28dffa8 to 0xc28dfff0) [ 45.303794] ffa0: 00000000 00000000 00000018 beae5c88 00000000 00000000 [ 45.308783] ffc0: 00000000 00000000 0133de50 00000128 00000001 00000004 beae5cd0 b6bbb040 [ 45.316894] ffe0: beae5c30 beae5c20 b6fbeb9c b6fbe00c [ 45.325425] ---[ end trace 328f4e6808844143 ]--- On Tue, Aug 29, 2023 at 8:17 PM Johannes Berg <johannes@xxxxxxxxxxxxxxxx> wrote: > > From: Johannes Berg <johannes.berg@xxxxxxxxx> > > As previously reported by Alexander, whose commit 69403bad97aa > ("wifi: mac80211: sdata can be NULL during AMPDU start") I'm > reverting as part of this commit, there's a race between station > destruction and aggregation setup, where the aggregation setup > can happen while the station is being removed and queue the work > after ieee80211_sta_tear_down_BA_sessions() has already run in > __sta_info_destroy_part1(), and thus the worker will run with a > now freed station. In his case, this manifested in a NULL sdata > pointer, but really there's no guarantee whatsoever. > > The real issue seems to be that it's possible at all to have a > situation where this occurs - we want to stop the BA sessions > when doing _part1, but we cannot be sure, and WLAN_STA_BLOCK_BA > isn't necessarily effective since we don't know that the setup > isn't concurrently running and already got past the check. > > Simply call ieee80211_sta_tear_down_BA_sessions() again in the > second part of station destruction, since at that point really > nothing else can hold a reference to the station any more. > > Also revert the sdata checks since those are just misleading at > this point. > > Signed-off-by: Johannes Berg <johannes.berg@xxxxxxxxx> > --- > net/mac80211/agg-tx.c | 6 +----- > net/mac80211/driver-ops.c | 3 --- > net/mac80211/sta_info.c | 14 ++++++++++++++ > 3 files changed, 15 insertions(+), 8 deletions(-) > > diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c > index 0627abb09f0e..b8a278355e18 100644 > --- a/net/mac80211/agg-tx.c > +++ b/net/mac80211/agg-tx.c > @@ -497,7 +497,7 @@ void ieee80211_tx_ba_session_handle_start(struct sta_info *sta, int tid) > { > struct tid_ampdu_tx *tid_tx; > struct ieee80211_local *local = sta->local; > - struct ieee80211_sub_if_data *sdata; > + struct ieee80211_sub_if_data *sdata = sta->sdata; > struct ieee80211_ampdu_params params = { > .sta = &sta->sta, > .action = IEEE80211_AMPDU_TX_START, > @@ -525,7 +525,6 @@ void ieee80211_tx_ba_session_handle_start(struct sta_info *sta, int tid) > */ > synchronize_net(); > > - sdata = sta->sdata; > params.ssn = sta->tid_seq[tid] >> 4; > ret = drv_ampdu_action(local, sdata, ¶ms); > tid_tx->ssn = params.ssn; > @@ -539,9 +538,6 @@ void ieee80211_tx_ba_session_handle_start(struct sta_info *sta, int tid) > */ > set_bit(HT_AGG_STATE_DRV_READY, &tid_tx->state); > } else if (ret) { > - if (!sdata) > - return; > - > ht_dbg(sdata, > "BA request denied - HW unavailable for %pM tid %d\n", > sta->sta.addr, tid); > diff --git a/net/mac80211/driver-ops.c b/net/mac80211/driver-ops.c > index 919300750527..169dbbca54b6 100644 > --- a/net/mac80211/driver-ops.c > +++ b/net/mac80211/driver-ops.c > @@ -409,9 +409,6 @@ int drv_ampdu_action(struct ieee80211_local *local, > might_sleep(); > lockdep_assert_wiphy(local->hw.wiphy); > > - if (!sdata) > - return -EIO; > - > sdata = get_bss_sdata(sdata); > if (!check_sdata_in_driver(sdata)) > return -EIO; > diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c > index abcc280acd38..2a61269a4b54 100644 > --- a/net/mac80211/sta_info.c > +++ b/net/mac80211/sta_info.c > @@ -1385,6 +1385,20 @@ static void __sta_info_destroy_part2(struct sta_info *sta, bool recalc) > * after _part1 and before _part2! > */ > > + /* > + * There's a potential race in _part1 where we set WLAN_STA_BLOCK_BA > + * but someone might have just gotten past a check, and not yet into > + * queuing the work/creating the data/etc. > + * > + * Do another round of destruction so that the worker is certainly > + * canceled before we later free the station. > + * > + * Since this is after synchronize_rcu()/synchronize_net() we're now > + * certain that nobody can actually hold a reference to the STA and > + * be calling e.g. ieee80211_start_tx_ba_session(). > + */ > + ieee80211_sta_tear_down_BA_sessions(sta, AGG_STOP_DESTROY_STA); > + > might_sleep(); > lockdep_assert_wiphy(local->hw.wiphy); > > -- > 2.41.0 >