On Wed, May 08, 2024 at 05:56:38PM -0700, Jakub Kicinski wrote: > On Wed, 8 May 2024 16:24:08 -0700 Joe Damato wrote: > > > A possible reason for this difference is the queues included in the sum. > > > Our stats are persistent across configuration changes, so they doesn't reset > > > when number of channels changes for example. > > > > > > We keep stats entries for al ring indices that ever existed. Our driver > > > loops and sums up the stats for all of them, while the stack loops only up > > > to the current netdev->real_num_rx_queues. > > > > > > Can this explain the diff here? > > > > Yes, that was it. Sorry I didn't realize this case. My lab machine runs a > > script to adjust the queue count shortly after booting. > > > > I disabled that and re-ran: > > > > NETIF=eth0 tools/testing/selftests/drivers/net/stats.py > > > > and all tests pass. > > Stating the obvious, perhaps, but in this case we should add the stats > from inactive queues to the base (which when the NIC is down means all > queues). If I'm following that right and understanding mlx5 (two things I am unlikely to do simultaneously), that sounds to me like: - mlx5e_get_queue_stats_rx and mlx5e_get_queue_stats_tx check if i < priv->channels.params.num_channels (instead of priv->stats_nch), and when summing mlx5e_sq_stats in the latter function, it's up to priv->channels.params.mqprio.num_tc instead of priv->max_opened_tc. - mlx5e_get_base_stats accumulates and outputs stats for everything from priv->channels.params.num_channels to priv->stats_nch, and priv->channels.params.mqprio.num_tc to priv->max_opened_tc... which should cover the inactive queues, I think. Just writing that all out to avoid hacking up the wrong thing for the v2 and to reduce overall noise on the list :)