On Fri, 27 Nov 2020 23:23:42 +0200 Vladimir Oltean wrote: > On Fri, Nov 27, 2020 at 01:13:46PM -0800, Jakub Kicinski wrote: > > On Fri, 27 Nov 2020 21:47:14 +0100 Andrew Lunn wrote: > > > > Is the periodic refresh really that awful? We're mostly talking error > > > > counters here so every second or every few seconds should be perfectly > > > > fine. > > > > > > Humm, i would prefer error counts to be more correct than anything > > > else. When debugging issues, you generally don't care how many packets > > > worked. It is how many failed you are interesting, and how that number > > > of failures increases. > > > > Right, but not sure I'd use the word "correct". Perhaps "immediately up > > to date"? > > > > High speed NICs usually go through a layer of firmware before they > > access the stats, IOW even if we always synchronously ask for the stats > > in the kernel - in practice a lot of NICs (most?) will return some form > > of cached information. > > > > > So long as these counters are still in ethtool -S, i guess it does not > > > matter. That i do trust to be accurate, and probably consistent across > > > the counters it returns. > > > > Not in the NIC designs I'm familiar with. > > > > But anyway - this only matters in some strict testing harness, right? > > Normal users will look at a stats after they noticed issues (so minutes > > / hours later) or at the very best they'll look at a graph, which will > > hardly require <1sec accuracy to when error occurred. > > Either way, can we conclude that ndo_get_stats64 is not a replacement > for ethtool -S, since the latter is blocking and, if implemented correctly, > can return the counters at the time of the call (therefore making sure > that anything that happened before the syscall has been accounted into > the retrieved values), and the former isn't? ethtool -S stats are not 100% up to date. Not on Netronome, Intel, Broadcom or Mellanox NICs AFAIK. > The whole discussion started because you said we shouldn't expose some > statistics counters in ethtool as long as they have a standardized > equivalent. Well, I think we still should. Users must have access to stats via standard Linux interfaces with well defined semantics. We cannot continue to live in the world where user has to guess driver specific name for ethtool -S to find out the number of CRC errors... I know it may not matter to a driver developer, and it didn't matter much to me when I was one, because in my drivers they always had the same name. But trying to monitor a fleet of hardware from multiple vendors is very painful with the status quo, we must do better. We can't have users scrape through what is basically a debug interface to get to vital information. I'd really love to find a way out of the procfs issue, but I'm not sure if there is one.