On 2021/4/19 22:57, Michal Kubecek wrote: > On Mon, Apr 19, 2021 at 10:04:27AM +0800, Yunsheng Lin wrote: >>> >>> I tried this patch o top of 5.12-rc7 with real devices. I used two >>> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs >>> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core >>> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on >>> a saturated ethernet, the CPU consumption grows quite a lot: >>> >>> threads unpatched 5.12-rc7 5.12-rc7 + v3 >>> 1 25.6% 30.6% >>> 8 73.1% 241.4% >>> 128 132.2% 1012.0% >>> >>> (The values are normalized to one core, i.e. 100% corresponds to one >>> fully used logical CPU.) I didn't perform a full statistical evaluation >>> but the growth is way beyond any statistical fluctuation with one >>> exception: 8-thread test of patched kernel showed values from 155.5% to >>> 311.4%. Closer look shows that most of the CPU time was spent in softirq >>> and running top in parallel with the test confirms that there are >>> multiple ksofirqd threads running at 100% CPU. I had similar problem >>> with earlier versions of my patch (work in progress, I still need to >>> check some corner cases and write commit message explaining the logic) >> >> Great, if there is a better idea, maybe share the core idea first so >> that we both can work on the that? > > I'm not sure if it's really better but to minimize the false positives > and unnecessary calls to __netif_schedule(), I replaced q->seqlock with > an atomic combination of a "running" flag (which corresponds to current > seqlock being locked) and a "drainers" count (number of other threads > going to clean up the qdisc queue). This way we could keep track of them > and get reliable information if another thread is going to run a cleanup > after we leave the qdisc_run() critical section (so that there is no > need to schedule). It seems you are trying to match the skb enqueuing with the calling of __qdisc_run() here, which is not reliable when considering the dequeue batching, see try_bulk_dequeue_skb() or try_bulk_dequeue_skb_slow() in dequeue_skb(). > >>> The biggest problem IMHO is that the loop in __qdisc_run() may finish >>> without rescheduling not only when the qdisc queue is empty but also >>> when the corresponding device Tx queue is stopped which devices tend to >>> do whenever they cannot send any more packets out. Thus whenever >>> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or >>> frozen, we keep rescheduling the queue cleanup without any chance to >>> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all >>> we need is another thready to fail the first spin_trylock() while device >>> queue is stopped and qdisc queue not empty. >> >> Yes, We could just return false before doing the second spin_trylock() if >> the netdev queue corresponding qdisc is stopped, and when the netdev queue >> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue(). >> >> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin: >> >> if (dont_retry || sch_qdisc_stopped()) >> return false; >> >> bool sch_qdisc_stopped(struct Qdisc *q) >> { >> const struct netdev_queue *txq = q->dev_queue; >> >> if (!netif_xmit_frozen_or_stopped(txq)) >> return true; >> >> reture false; >> } >> >> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable? > > Either this or you can do the check in qdisc_run_end() - when the device > queue is stopped or frozen, there is no need to schedule as we know it's > going to be done when the flag is cleared again (and we cannot do > anything until then anyway). > >>> Another part of the problem may be that to avoid the race, the logic is >>> too pessimistic: consider e.g. (dotted lines show "barriers" where >>> ordering is important): >>> >>> CPU A CPU B >>> spin_trylock() succeeds >>> pfifo_fast_enqueue() >>> .................................................................. >>> skb_array empty, exit loop >>> first spin_trylock() fails >>> set __QDISC_STATE_NEED_RESCHEDULE >>> second spin_trylock() fails >>> .................................................................. >>> spin_unlock() >>> call __netif_schedule() >>> >>> When we switch the order of spin_lock() on CPU A and second >>> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE >>> before CPU A tests it), we end up scheduling a queue cleanup even if >>> there is already one running. And either of these is quite realistic. >> >> But I am not sure it is a good thing or bad thing when the above happen, >> because it may be able to enable the tx batching? > > In either of the two scenarios, we call __netif_schedule() to schedule > a cleanup which does not do anything useful. In first, the qdics queue > is empty so that either the scheduled cleanup finds it empty or there > will be some new packets which would have their own cleanup. In second, > scheduling a cleanup will not prevent the other thread from doing its > own cleanup it already started. The main idea is that a thread(holding q->seqlock)to do the dequeuing as much as possible while other threads are enqueuing skb, which seems possible to avoid the above case? > >>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >>>> index 44991ea..4953430 100644 >>>> --- a/net/sched/sch_generic.c >>>> +++ b/net/sched/sch_generic.c >>>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) >>>> { >>>> struct pfifo_fast_priv *priv = qdisc_priv(qdisc); >>>> struct sk_buff *skb = NULL; >>>> + bool need_retry = true; >>>> int band; >>>> >>>> +retry: >>>> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) { >>>> struct skb_array *q = band2list(priv, band); >>>> >>>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) >>>> } >>>> if (likely(skb)) { >>>> qdisc_update_stats_at_dequeue(qdisc, skb); >>>> + } else if (need_retry && >>>> + test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE, >>>> + &qdisc->state)) { >>>> + /* do another enqueuing after clearing the flag to >>>> + * avoid calling __netif_schedule(). >>>> + */ >>>> + smp_mb__after_atomic(); >>>> + need_retry = false; >>>> + >>>> + goto retry; >>>> } else { >>>> WRITE_ONCE(qdisc->empty, true); >>>> }i >>> >>> Does the retry really provide significant improvement? IIUC it only >>> helps if all of enqueue, first spin_trylock() failure and setting the >>> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the >>> skb_array empty and checking the flag. If enqueue happens before we >>> check the array (and flag is set before the check), the retry is >>> useless. If the flag is set after we check it, we don't catch it (no >>> matter if the enqueue happened before or after we found skb_array >>> empty). >> >> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)" >> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as >> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the >> queue is empty. > > This is what I'm worried about. We are trying to address a race > condition which is extremely rare so we should avoid adding too much > overhead to the normal use. > >> And it has about 5% performance improvement. > > OK then. It thought that it would do an unnecessary dequeue retry much > more often than prevent an unnecessary __netif_schedule() but I did not > run any special benchmark to check. When netdev tx queue is not stopped: 1. if __QDISC_STATE_NEED_RESCHEDULE is set and there is a lot of skb to be dequeued, it is likely that __netif_schedule() is already called in __qdisc_run(), so the __netif_schedule() called in qdisc_run_end() is no-op. 2. if __QDISC_STATE_NEED_RESCHEDULE is set during the data race this patch is trying to fix, then we do need to call __netif_schedule(). 3. otherwise the __QDISC_STATE_NEED_RESCHEDULE is cleared in test_and_clear() added in pfifo_fast_dequeue(). The main point here is to delay clearing __QDISC_STATE_NEED_RESCHEDULE bit as much as possible so that the set_bit() and second spin_trylock() is avoided. > > Michal > > . >