Re: [net-next PATCH 3/3] net: stmmac: increase TX coalesce timer to 5ms

Dave Taht <dave.taht@xxxxxxxxx> · Fri, 22 Sep 2023 13:02:55 -0700

On Fri, Sep 22, 2023 at 5:39 AM Christian Marangi <ansuelsmth@xxxxxxxxx> wrote:
>
> On Fri, Sep 22, 2023 at 02:28:06PM +0200, Andrew Lunn wrote:
> > On Fri, Sep 22, 2023 at 01:12:47PM +0200, Christian Marangi wrote:
> > > Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix
> > > multi-queue races") decreased the TX coalesce timer from 40ms to 1ms.
> > >
> > > This caused some performance regression on some target (regression was
> > > reported at least on ipq806x) in the order of 600mbps dropping from
> > > gigabit handling to only 200mbps.
> > >
> > > The problem was identified in the TX timer getting armed too much time.
> > > While this was fixed and improved in another commit, performance can be
> > > improved even further by increasing the timer delay a bit moving from
> > > 1ms to 5ms.

I am always looking for finding ways to improve interrupt service
time, rather than paper over the problem by increasing batchi-ness.

http://www.taht.net/~d/broadcom_aug9_2018.pdf

But also looking for hard data, particularly as to observed power
savings. How much power does upping this number save?

I have tried to question other assumptions more modern kernels are
making, in particular I wish more folk would experience with
decreasing the overlarge (IMHO) NAPI default of 64 packets to, say 8
in the mq case, benefiting from multiple arm cores still equipped with
limited cache, as well as looking at the impact of TLB flushes. Other
deferred multi-core processing... that is looking good on a modern
xeon, but might not be so good on a more limited arm, worries me.

Over here there was an enormous test series recently run against a
bunch of older arm64s which appears to indicate that memory bandwidth
is a source of problems:

https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI/edit

We are looking to add more devices to that testbed.

> > >
> > > The value is a good balance between battery saving by prevending too
> > > much interrupt to be generated and permitting good performance for
> > > internet oriented devices.
> >
> > ethtool has a settings you can use for this:
> >
> >       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
> >               [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
> >               [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
> >               [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
> >               [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
> >               [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
> >               [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
> >               [cqe-mode-rx on|off] [cqe-mode-tx on|off] [tx-aggr-max-bytes N]
> >               [tx-aggr-max-frames N] [tx-aggr-time-usecs N]
> >
> > If this is not implemented, i suggest you add support for it.
> >
> > Changing the default might cause regressions. Say there is a VoIP
> > application which wants this low latency? It would be safer to allow
> > user space to configure it as wanted.
> >
>
> Yep stmmac already support it. Idea here was to not fallback to use
> ethtool and find a good value.
>
> Just for reference before one commit, the value was set to 40ms and
> nobody ever pointed out regression about VoIP application. Wtih some
> testing I found 5ms a small increase that restore original perf and
> should not cause any regression.

Does this driver have BQL?

> (for reference keeping this to 1ms cause a lost of about 100-200mbps)
> (also the tx timer implementation was created before any napi poll logic
> and before dma interrupt handling was a thing, with the later change I
> expect this timer to be very little used in VoIP scenario or similar
> with continuous traffic as napi will take care of handling packet)

I would be pretty interested in a kernel flame graph of the before vs the after.

> Aside from these reason I totally get the concern and totally ok with
> this not getting applied, was just an idea to push for a common value.

I try to get people to run much longer and more complicated tests such
as the flent rrul test to see what kind of damage bigger buffers did
to latency, as well as how other problems might show up. Really
notable in the above test series was how badly various devices behaved
over time on that workload. Extremely notable in that test series
above was how badly the  jetson performed:

https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/jetson.png

And the nanopi was weird.

https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/nanopi-neo3.png

> Just preferred to handle this here instead of script+userspace :(
> (the important part is the previous patch)
>
> --
>         Ansuel
>
--
Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
Dave Täht CSO, LibreQos