On Tue, 11 May 2021 20:13:56 +0800 Yunsheng Lin wrote: > On 2021/5/11 17:04, Yunsheng Lin wrote: > > On 2021/5/11 12:22, Jakub Kicinski wrote: > >> The queues are woken asynchronously without holding any locks via > >> netif_tx_wake_queue(). Theoretically we can have a situation where: > >> > >> CPU 0 CPU 1 > >> . . > >> dequeue_skb() . > >> netif_xmit_frozen..() # true . > >> . [IRQ] > >> . netif_tx_wake_queue() > >> . <end of IRQ> > >> . netif_tx_action() > >> . set MISSED > >> clear MISSED > >> return NULL > >> ret from qdisc_restart() > >> ret from __qdisc_run() > >> qdisc_run_end() > [...] > > > > Yes, the above does seems to have the above data race. > > > > As my understanding, there is two ways to fix the above data race: > > 1. do not clear the STATE_MISSED for netif_xmit_frozen_or_stopped() > > case, just check the netif_xmit_frozen_or_stopped() before > > calling __netif_schedule() at the end of qdisc_run_end(). This seems > > to only work with qdisc with TCQ_F_ONETXQUEUE flag because it seems > > we can only check the netif_xmit_frozen_or_stopped() with q->dev_queue, > > I am not sure q->dev_queue is pointint to which netdev queue when qdisc > > is not set with TCQ_F_ONETXQUEUE flag. Isn't the case where we have a NOLOCK qdisc without TCQ_F_ONETXQUEUE rather unexpected? It'd have to be a single pfifo on multi-queue netdev, right? Sounds not worth optimizing for. How about: static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); if (qdisc->flags & TCQ_F_NOLOCK) { spin_unlock(&qdisc->seqlock); if (unlikely(test_bit(__QDISC_STATE_MISSED, &qdisc->state))) { clear_bit(__QDISC_STATE_MISSED, &qdisc->state); if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_stopped(q->dev_queue)) __netif_schedule(qdisc); } } } For the strange non-ONETXQUEUE case we'd have an occasional unnecessary net_tx_action, but no infinite loop possible. > > 2. clearing the STATE_MISSED for netif_xmit_frozen_or_stopped() case > > as this patch does, and protect the __netif_schedule() with q->seqlock > > for netif_tx_wake_queue(), which might bring unnecessary overhead for > > non-stopped queue case > > > > Any better idea? > > 3. Or check the netif_xmit_frozen_or_stopped() again after clearing > STATE_MISSED, like below: > > if (netif_xmit_frozen_or_stopped(txq)) { > clear_bit(__QDISC_STATE_MISSED, &q->state); > > /* Make sure the below netif_xmit_frozen_or_stopped() > * checking happens after clearing STATE_MISSED. > */ > smp_mb__after_atomic(); > > /* Checking netif_xmit_frozen_or_stopped() again to > * make sure __QDISC_STATE_MISSED is set if the > * __QDISC_STATE_MISSED set by netif_tx_wake_queue()'s > * rescheduling of net_tx_action() is cleared by the > * above clear_bit(). > */ > if (!netif_xmit_frozen_or_stopped(txq)) > set_bit(__QDISC_STATE_MISSED, &q->state); > } > > It is kind of ugly, but it does seem to fix the above data race too. > And it seems like a common pattern to deal with the concurrency between > xmit and NAPI polling, as below: > > https://elixir.bootlin.com/linux/v5.12-rc2/source/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c#L1409 This is indeed the idiomatic way of dealing with Tx queue stopping race, but it's a bit of code to sprinkle around. My vote would be option 1.