Hi Alexander, alex.aring@xxxxxxxxx wrote on Sun, 13 Mar 2022 16:43:52 -0400: > Hi, > > On Fri, Mar 4, 2022 at 5:54 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote: > > > I had a second look at it and it appears to me that the issue was > > already there and is structural. We just did not really cared about it > > because we didn't bother with synchronization issues. > > > > I am not sure if I understand correctly. We stop the queue at some > specific moment and we need to make sure that xmit_do() is not called > or can't be called anymore. > > I was thinking about: > > void ieee802154_disable_queue(struct ieee802154_hw *hw) > { > struct ieee802154_local *local = hw_to_local(hw); > struct ieee802154_sub_if_data *sdata; > > rcu_read_lock(); > list_for_each_entry_rcu(sdata, &local->interfaces, list) { > if (!sdata->dev) > continue; > > netif_tx_disable(sdata->dev); > } > rcu_read_unlock(); > } > EXPORT_SYMBOL(ieee802154_stop_queue); > > From my quick view is that "netif_tx_disable()" ensures by holding > locks and other things and doing netif_tx_stop_queue() it we can be > sure there will be no xmit_do() going on while it's called and > afterwards. It can be that there are still transmissions on the > transceiver that are on the way, but then your atomic counter and > wait_event() will come in place. I went for a deeper investigation to understand how the net core was calling our callbacks. And it appeared to go through dev_hard_start_xmit() and come from __dev_queue_xmit(). This means the ieee802154 callback could only be called once at a time because it is protected by the network device transmit lock (netif_tx_lock()). Which makes the logic safe and not racy as I initially thought. This was the missing peace in my mental model I believe. > We need to be sure there will be nothing queued anymore for > transmission what (in my opinion) tx_disable() does. from any context. > > We might need to review some netif callbacks... I have in my mind for > example stop(), maybe netif_tx_stop_queue() is enough (because the > context is like netif_tx_disable(), helding similar locks, etc.) but > we might want to be sure that nothing is going on anymore by using > your wait_event() with counter. I don't see a real reason anymore to use the tx_disable() call. Is there any reason this could be needed that I don't have in mind? Right now the only thing that I see is that it could delay a little bit the moment where we actually stop the queue because we would be waiting for the lock to be released after the skb has been offloaded to hardware. Perhaps maybe we would let another frame to be transmitted before we actually get the lock. > Is there any problem which I don't see? One question however, as I understand, if userspace tries to send more packets, I believe the "if (!stopped)" condition will be false and the xmit call will simply be skipped, ending with a -ENETDOWN error [1]. Is it what we want? I initially thought we could actually queue patches and wait for the queue to be re-enabled again, but it does not look easy. [1] https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L4249 Thanks, Miquèl