On Thu, 28 Feb 2019 at 16:59, <greearb@xxxxxxxxxxxxxxx> wrote: > > From: Ben Greear <greearb@xxxxxxxxxxxxxxx> > > I often saw the ath10k-ct wave-1 firmware spit DMA errors and > hang the entire system, requiring a hard power-cycle to revoer. > > It appears the issue is that there is no beacon-tx callback in > stock firmware, so the driver can delete the beacon DMA buffer > while firmware is still trying to access it. > > So, wave-1 ath10k-ct firmware now sends a beacon-tx-complete > wmi message and that allows the driver to safely know when it > can clean up the buffer. > > Signed-off-by: Ben Greear <greearb@xxxxxxxxxxxxxxx> > --- > > NOTE: This will not apply or work in upstream kernels since the > rest of the CT fw support will not be accepted. But, I'd appreciate > any technical feedback on this in case I missed any corner cases > on locking or similar. For the record this patch seems to be based on code with "ath10k: Free beacon buf later in vdev teardown" included. [...] > diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c > index 154dcdabc48a..02a8efa2e783 100644 > --- a/drivers/net/wireless/ath/ath10k/mac.c > +++ b/drivers/net/wireless/ath/ath10k/mac.c [...] > @@ -6147,6 +6151,16 @@ static void ath10k_remove_interface(struct ieee80211_hw *hw, > ath10k_warn(ar, "failed to stop spectral for vdev %i: %d\n", > arvif->vdev_id, ret); > > + if (test_bit(ATH10K_FW_FEATURE_BEACON_TX_CB_CT, > + ar->running_fw->fw_file.fw_features)) { > + int time_left; > + > + time_left = wait_for_completion_timeout(&arvif->beacon_tx_done, (3 * HZ)); > + if (!time_left) > + ath10k_warn(ar, "WARNING: failed to wait for beacon tx callback for vdev %i: %d\n", > + arvif->vdev_id, ret); > + } I think this can race against wmi tx credits replenishment and maybe other wmi commands if they get issued (not many of them I suppose though). The ordering would need to be something like this: cpu0 cpu1 ath10k_wmi_op_ep_tx_credits ieee80211_iterate_active_interfaces_atomic ath10k_wmi_beacon_send_ref_nowait # preempted ieee80211_do_stop clear sdata running drv_remove_interface wait_for_completion_timeout ath10k_mac_vif_beacon_cleanup free/unmap gen_beacon_dma(bcn->xxx) reinit_completion() ath10k_wmi_cmd_send_nowait() Interestingly, this also shows it has been a possible use-after-free without your patch. This is probably a unlikely scenario because there's a disproportionate amount of code between these flows for this to happen. There's also a memory leak because ath10k_mac_vif_beacon_free() exits without freeing arvif->beacon if beacon_state is SENDING, which it will be in in the above race case. With the per-skb beaconing case (which is dead code by the way it seems..) iommu wouldn't be happy either because we'd hand over an unmapped paddr to ath10k_wmi_cmd_send_nowait(). That's assuming kernel didn't crash due to use-after-free on arvif->beacon (stored in `bcn` on stack) before that. This feels like a candidate for rcu if you wanted to have it fixed neatly. It would fix both use-after-free and the locking conundrum - ath10k_remove_interface() could NULL the pointer, call_rcu() and wait_for_completion() in that call handler, and when its called back unmap/free the beacon. This would require a refactor to how beacon stuff is stored/maintained - a helper structure that would need carry the beacon stuff + rcu_head, and this structure would be allocated and assigned with rcu_assign_pointer() to a pointer in ath10k_vif. But I feel like it's asking for unbound allocations or other bugs, especially since I'm... Not sure if it makes any sense. I would probably just rip out the arvif->beacon code completely instead and keep arvif->beacon_buf only (memcpy on swba and immediatelly free the skbuff). No point in fixing dead code, is there? Since you have beacon tx completion you can consider preventing beacon corruption, similar to how disabled vsync causes frame tearing, by avoiding memcpy() if completion for previous one didn't come. Not sure how relevant that is though because if fw/hw are having hard time transmitting a beacon then the RF condition must be really harsh and unusably bad anyway. Michał