From: Kloetzke Jan <Jan.Kloetzke@xxxxxxx> Date: Thu, 16 May 2019 07:10:30 +0000 > Am Montag, den 06.05.2019, 10:17 +0200 schrieb Oliver Neukum: >> On So, 2019-05-05 at 00:45 -0700, David Miller wrote: >> > From: Kloetzke Jan <Jan.Kloetzke@xxxxxxx> >> > Date: Tue, 30 Apr 2019 14:15:07 +0000 >> > >> > > @@ -1431,6 +1432,11 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb, >> > > spin_unlock_irqrestore(&dev->txq.lock, flags); >> > > goto drop; >> > > } >> > > + if (WARN_ON(netif_queue_stopped(net))) { >> > > + usb_autopm_put_interface_async(dev->intf); >> > > + spin_unlock_irqrestore(&dev->txq.lock, flags); >> > > + goto drop; >> > > + } >> > >> > If this is known to happen and is expected, then we should not warn. >> > >> >> yes this is the point. Can ndo_start_xmit() and ndo_stop() race? >> If not, why does the patch fix the observed issue and what >> prevents the race? Something is not clear here. > > Dave, could you shed some light on Olivers question? If the race can > happen then we can stick to v1 because the WARN_ON is indeed pointless. > Otherwise it's not clear why it made the problem go away for us and v2 > may be the better option... Yes I think they can race. ->ndo_stop() executes and stops the queue, then we get an RCU grace period so that all parallel executions of ->ndo_start_xmit() complete. But I wonder, this can probably cause problems because some drivers have "stop queue and re-check" logic, f.e. in drivers/net/tg3.c we have: if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) { netif_tx_stop_queue(txq); /* netif_tx_stop_queue() must be done before checking * checking tx index in tg3_tx_avail() below, because in * tg3_tx(), we update tx index before checking for * netif_tx_queue_stopped(). */ smp_mb(); if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi)) netif_tx_wake_queue(txq); } which in the racey scenerio would undo ->ndo_stop()'s work which is completely unexpected. Hmmm...