On Fri, Sep 22, 2023 at 5:39 AM Christian Marangi <ansuelsmth@xxxxxxxxx> wrote: > > On Fri, Sep 22, 2023 at 02:28:06PM +0200, Andrew Lunn wrote: > > On Fri, Sep 22, 2023 at 01:12:47PM +0200, Christian Marangi wrote: > > > Commit 8fce33317023 ("net: stmmac: Rework coalesce timer and fix > > > multi-queue races") decreased the TX coalesce timer from 40ms to 1ms. > > > > > > This caused some performance regression on some target (regression was > > > reported at least on ipq806x) in the order of 600mbps dropping from > > > gigabit handling to only 200mbps. > > > > > > The problem was identified in the TX timer getting armed too much time. > > > While this was fixed and improved in another commit, performance can be > > > improved even further by increasing the timer delay a bit moving from > > > 1ms to 5ms. I am always looking for finding ways to improve interrupt service time, rather than paper over the problem by increasing batchi-ness. http://www.taht.net/~d/broadcom_aug9_2018.pdf But also looking for hard data, particularly as to observed power savings. How much power does upping this number save? I have tried to question other assumptions more modern kernels are making, in particular I wish more folk would experience with decreasing the overlarge (IMHO) NAPI default of 64 packets to, say 8 in the mq case, benefiting from multiple arm cores still equipped with limited cache, as well as looking at the impact of TLB flushes. Other deferred multi-core processing... that is looking good on a modern xeon, but might not be so good on a more limited arm, worries me. Over here there was an enormous test series recently run against a bunch of older arm64s which appears to indicate that memory bandwidth is a source of problems: https://docs.google.com/document/d/1HxIU_TEBI6xG9jRHlr8rzyyxFEN43zMcJXUFlRuhiUI/edit We are looking to add more devices to that testbed. > > > > > > The value is a good balance between battery saving by prevending too > > > much interrupt to be generated and permitting good performance for > > > internet oriented devices. > > > > ethtool has a settings you can use for this: > > > > ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off] > > [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N] > > [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N] > > [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N] > > [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N] > > [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N] > > [tx-usecs-high N] [tx-frames-high N] [sample-interval N] > > [cqe-mode-rx on|off] [cqe-mode-tx on|off] [tx-aggr-max-bytes N] > > [tx-aggr-max-frames N] [tx-aggr-time-usecs N] > > > > If this is not implemented, i suggest you add support for it. > > > > Changing the default might cause regressions. Say there is a VoIP > > application which wants this low latency? It would be safer to allow > > user space to configure it as wanted. > > > > Yep stmmac already support it. Idea here was to not fallback to use > ethtool and find a good value. > > Just for reference before one commit, the value was set to 40ms and > nobody ever pointed out regression about VoIP application. Wtih some > testing I found 5ms a small increase that restore original perf and > should not cause any regression. Does this driver have BQL? > (for reference keeping this to 1ms cause a lost of about 100-200mbps) > (also the tx timer implementation was created before any napi poll logic > and before dma interrupt handling was a thing, with the later change I > expect this timer to be very little used in VoIP scenario or similar > with continuous traffic as napi will take care of handling packet) I would be pretty interested in a kernel flame graph of the before vs the after. > Aside from these reason I totally get the concern and totally ok with > this not getting applied, was just an idea to push for a common value. I try to get people to run much longer and more complicated tests such as the flent rrul test to see what kind of damage bigger buffers did to latency, as well as how other problems might show up. Really notable in the above test series was how badly various devices behaved over time on that workload. Extremely notable in that test series above was how badly the jetson performed: https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/jetson.png And the nanopi was weird. https://github.com/randomizedcoder/cake/blob/2023_09_02/pfifo_fast/nanopi-neo3.png > Just preferred to handle this here instead of script+userspace :( > (the important part is the previous patch) > > -- > Ansuel > -- Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html Dave Täht CSO, LibreQos