On 2021/4/19 10:04, Yunsheng Lin wrote: > On 2021/4/19 6:59, Michal Kubecek wrote: >> On Thu, Mar 25, 2021 at 11:13:11AM +0800, Yunsheng Lin wrote: >>> Lockless qdisc has below concurrent problem: >>> cpu0 cpu1 >>> . . >>> q->enqueue . >>> . . >>> qdisc_run_begin() . >>> . . >>> dequeue_skb() . >>> . . >>> sch_direct_xmit() . >>> . . >>> . q->enqueue >>> . qdisc_run_begin() >>> . return and do nothing >>> . . >>> qdisc_run_end() . >>> >>> cpu1 enqueue a skb without calling __qdisc_run() because cpu0 >>> has not released the lock yet and spin_trylock() return false >>> for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb >>> enqueued by cpu1 when calling dequeue_skb() because cpu1 may >>> enqueue the skb after cpu0 calling dequeue_skb() and before >>> cpu0 calling qdisc_run_end(). >>> >>> Lockless qdisc has below another concurrent problem when >>> tx_action is involved: >>> >>> cpu0(serving tx_action) cpu1 cpu2 >>> . . . >>> . q->enqueue . >>> . qdisc_run_begin() . >>> . dequeue_skb() . >>> . . q->enqueue >>> . . . >>> . sch_direct_xmit() . >>> . . qdisc_run_begin() >>> . . return and do nothing >>> . . . >>> clear __QDISC_STATE_SCHED . . >>> qdisc_run_begin() . . >>> return and do nothing . . >>> . . . >>> . qdisc_run_end() . >>> >>> This patch fixes the above data race by: >>> 1. Get the flag before doing spin_trylock(). >>> 2. If the first spin_trylock() return false and the flag is not >>> set before the first spin_trylock(), Set the flag and retry >>> another spin_trylock() in case other CPU may not see the new >>> flag after it releases the lock. >>> 3. reschedule if the flags is set after the lock is released >>> at the end of qdisc_run_end(). >>> >>> For tx_action case, the flags is also set when cpu1 is at the >>> end if qdisc_run_end(), so tx_action will be rescheduled >>> again to dequeue the skb enqueued by cpu2. >>> >>> Only clear the flag before retrying a dequeuing when dequeuing >>> returns NULL in order to reduce the overhead of the above double >>> spin_trylock() and __netif_schedule() calling. >>> >>> The performance impact of this patch, tested using pktgen and >>> dummy netdev with pfifo_fast qdisc attached: >>> >>> threads without+this_patch with+this_patch delta >>> 1 2.61Mpps 2.60Mpps -0.3% >>> 2 3.97Mpps 3.82Mpps -3.7% >>> 4 5.62Mpps 5.59Mpps -0.5% >>> 8 2.78Mpps 2.77Mpps -0.3% >>> 16 2.22Mpps 2.22Mpps -0.0% >>> >>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") >>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx> >>> --- >>> V3: fix a compile error and a few comment typo, remove the >>> __QDISC_STATE_DEACTIVATED checking, and update the >>> performance data. >>> V2: Avoid the overhead of fixing the data race as much as >>> possible. >> >> I tried this patch o top of 5.12-rc7 with real devices. I used two >> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs >> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core >> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on >> a saturated ethernet, the CPU consumption grows quite a lot: >> >> threads unpatched 5.12-rc7 5.12-rc7 + v3 >> 1 25.6% 30.6% >> 8 73.1% 241.4% >> 128 132.2% 1012.0% I do not really read the above number, but I understand that v3 has a cpu usage impact when it is patched to 5.12-rc7, so I do a test too on a arm64 system with a hns3 100G netdev, which is in node 0, and node 0 has 32 cpus. root@(none)$ cat /sys/class/net/eth4/device/numa_node 0 root@(none)$ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 128646 MB node 0 free: 127876 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 129019 MB node 1 free: 128796 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 129019 MB node 2 free: 128755 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 127989 MB node 3 free: 127772 MB node distances: node 0 1 2 3 0: 10 16 32 33 1: 16 10 25 32 2: 32 25 10 16 3: 33 32 16 10 and I use below cmd to only use 16 tx/rx queue with 72 tx queue depth in order to trigger the tx queue stopping handling: ethtool -L eth4 combined 16 ethtool -G eth4 tx 72 threads unpatched patched_v4 patched_v4+queue_stopped_opt 16 11% (si: 3.8%) 20% (si:13.5%) 11% (si:3.8%) 32 12% (si: 4.4%) 30% (si:22%) 13% (si:4.4%) 64 13% (si: 4.9%) 35% (si:26%) 13% (si:4.7%) "11% (si: 3.8%)": 11% means the total cpu useage in node 0, and 3.8% means the softirq cpu useage . And thread number is as below iperf cmd: taskset -c 0-31 iperf -c 192.168.100.2 -t 1000000 -i 1 -P *thread* It seems after applying the queue_stopped_opt patch, the cpu usage is closed to the unpatch one, at least with my testcase, maybe you can try your testcase with the queue_stopped_opt patch to see if it make any difference? patched_v4: https://patchwork.kernel.org/project/netdevbpf/patch/1618535809-11952-2-git-send-email-linyunsheng@xxxxxxxxxx/ And queue_stopped_opt: diff --git a/net/core/dev.c b/net/core/dev.c index 5d6f2e2..62008d5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3762,7 +3762,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - qdisc_run(q); + if (likely(rc == NET_XMIT_SUCCESS && + !netif_xmit_frozen_or_stopped(txq))) + qdisc_run(q); if (unlikely(to_free)) kfree_skb_list(to_free); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 6d7f954..079a825 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -74,6 +74,7 @@ static inline struct sk_buff *__skb_dequeue_bad_txq(struct Qdisc *q) } } else { skb = SKB_XOFF_MAGIC; + clear_bit(__QDISC_STATE_MISSED, &q->state); } } @@ -242,6 +243,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, } } else { skb = NULL; + clear_bit(__QDISC_STATE_MISSED, &q->state); } if (lock) spin_unlock(lock); @@ -251,8 +253,10 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, *validate = true; if ((q->flags & TCQ_F_ONETXQUEUE) && - netif_xmit_frozen_or_stopped(txq)) + netif_xmit_frozen_or_stopped(txq)) { + clear_bit(__QDISC_STATE_MISSED, &q->state); return skb; + } skb = qdisc_dequeue_skb_bad_txq(q); if (unlikely(skb)) { @@ -311,6 +315,8 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_xmit_frozen_or_stopped(txq)) skb = dev_hard_start_xmit(skb, dev, txq, &ret); + else + clear_bit(__QDISC_STATE_MISSED, &q->state); HARD_TX_UNLOCK(dev, txq); } else { >> >> (The values are normalized to one core, i.e. 100% corresponds to one >> fully used logical CPU.) I didn't perform a full statistical evaluation >> but the growth is way beyond any statistical fluctuation with one >> exception: 8-thread test of patched kernel showed values from 155.5% to >> 311.4%. Closer look shows that most of the CPU time was spent in softirq >> and running top in parallel with the test confirms that there are >> multiple ksofirqd threads running at 100% CPU. I had similar problem >> with earlier versions of my patch (work in progress, I still need to >> check some corner cases and write commit message explaining the logic) > > Great, if there is a better idea, maybe share the core idea first so > that we both can work on the that? > >> and tracing confirmed that similar problem (non-empty queue, no other >> thread going to clean it up but device queue stopped) was happening >> repeatedly most of the time. > > Make sense, maybe that is why the dummy netdevice can not catch this kind > of performance degradation because it always consume the skb without stopping > the tx queue, and a real netdevice with limited queue depth may stop the > queue when there are multil skbs queuing concurrently. > > I think for the above to happen, there may be a lot of tx doorbell batching > happening, it would be better to see how many packets is enqueued at a time > when trace_qdisc_dequeue() tracepoint is enabled? > >> >> The biggest problem IMHO is that the loop in __qdisc_run() may finish >> without rescheduling not only when the qdisc queue is empty but also >> when the corresponding device Tx queue is stopped which devices tend to >> do whenever they cannot send any more packets out. Thus whenever >> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or >> frozen, we keep rescheduling the queue cleanup without any chance to >> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all >> we need is another thready to fail the first spin_trylock() while device >> queue is stopped and qdisc queue not empty. > > Yes, We could just return false before doing the second spin_trylock() if > the netdev queue corresponding qdisc is stopped, and when the netdev queue > is restarted, __netif_schedule() is called again, see netif_tx_wake_queue(). > > Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin: > > if (dont_retry || sch_qdisc_stopped()) > return false; > > bool sch_qdisc_stopped(struct Qdisc *q) > { > const struct netdev_queue *txq = q->dev_queue; > > if (!netif_xmit_frozen_or_stopped(txq)) > return true; > > reture false; > } > > At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable? > >> >> Another part of the problem may be that to avoid the race, the logic is >> too pessimistic: consider e.g. (dotted lines show "barriers" where >> ordering is important): >> >> CPU A CPU B >> spin_trylock() succeeds >> pfifo_fast_enqueue() >> .................................................................. >> skb_array empty, exit loop >> first spin_trylock() fails >> set __QDISC_STATE_NEED_RESCHEDULE >> second spin_trylock() fails >> .................................................................. >> spin_unlock() >> call __netif_schedule() >> >> When we switch the order of spin_lock() on CPU A and second >> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE >> before CPU A tests it), we end up scheduling a queue cleanup even if >> there is already one running. And either of these is quite realistic. > > But I am not sure it is a good thing or bad thing when the above happen, > because it may be able to enable the tx batching? > >> >>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h >>> index f7a6e14..e3f46eb 100644 >>> --- a/include/net/sch_generic.h >>> +++ b/include/net/sch_generic.h >>> @@ -36,6 +36,7 @@ struct qdisc_rate_table { >>> enum qdisc_state_t { >>> __QDISC_STATE_SCHED, >>> __QDISC_STATE_DEACTIVATED, >>> + __QDISC_STATE_NEED_RESCHEDULE, >>> }; >> >> I'm not sure if putting the flag here is a good idea. If you look at the >> history of struct Qdisc reshufflings, this part (cacheline) should be >> used for members which don't change too often. However, this new flag is >> going to be touched about as often and in similar places as q->seqlock >> or q->empty so that it should probably be in the last cacheline with >> them. > > I am not sure about that too, but we can always adjust it's location > if it provdes to be the case:) > >> >> [...] >>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >>> index 44991ea..4953430 100644 >>> --- a/net/sched/sch_generic.c >>> +++ b/net/sched/sch_generic.c >>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) >>> { >>> struct pfifo_fast_priv *priv = qdisc_priv(qdisc); >>> struct sk_buff *skb = NULL; >>> + bool need_retry = true; >>> int band; >>> >>> +retry: >>> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { >>> struct skb_array *q = band2list(priv, band); >>> >>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) >>> } >>> if (likely(skb)) { >>> qdisc_update_stats_at_dequeue(qdisc, skb); >>> + } else if (need_retry && >>> + test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE, >>> + &qdisc->state)) { >>> + /* do another enqueuing after clearing the flag to >>> + * avoid calling __netif_schedule(). >>> + */ >>> + smp_mb__after_atomic(); >>> + need_retry = false; >>> + >>> + goto retry; >>> } else { >>> WRITE_ONCE(qdisc->empty, true); >>> }i >> >> Does the retry really provide significant improvement? IIUC it only >> helps if all of enqueue, first spin_trylock() failure and setting the >> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the >> skb_array empty and checking the flag. If enqueue happens before we >> check the array (and flag is set before the check), the retry is >> useless. If the flag is set after we check it, we don't catch it (no >> matter if the enqueue happened before or after we found skb_array >> empty). > > In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)" __QDISC_STATE_MISSED is in V4, so should be: set_bit(__QDISC_STATE_NEED_RESCHEDULE, &qdisc->state) The above data race is not really the main concern here, the main concern is when to clear the __QDISC_STATE_NEED_RESCHEDULE in order to avoid the multi cpus doing the set_bit(). > as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as > as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the > queue is empty. > And it has about 5% performance improvement. > >> >> Michal >> >> . >> > _______________________________________________ > Linuxarm mailing list -- linuxarm@xxxxxxxxxxxxx > To unsubscribe send an email to linuxarm-leave@xxxxxxxxxxxxx >