On 15 Nov 13:50, Yafang Shao wrote:
On Fri, Nov 15, 2024 at 12:32 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
On Fri, 15 Nov 2024 11:56:38 +0800 Yafang Shao wrote:
> > On Thu, 14 Nov 2024 10:17:11 +0800 Yafang Shao wrote:
> > > - * Not recommended for use in drivers for high speed interfaces.
> >
> > I thought I suggested we provide clear guidance on this counter being
> > related to processing pipeline being to slow, vs host backpressure.
> > Just deleting the line that says "don't use" is not going to cut it :|
>
> Hello Jakub,
>
> After investigating other network drivers, I found that they all
> report this metric to rx_missed_errors:
>
> - i40e
> The corresponding ethtool metric is port.rx_discards, which was
> mapped to rx_missed_errors in commit 5337d2949733 ("i40e: Add
> rx_missed_errors for buffer exhaustion").
>
> - broadcom
> The equivalent metric is rx_total_discard_pkts, reported as
> rx_missed_errors in commit c0c050c58d84 ("bnxt_en: New Broadcom
> ethernet driver")
>
> Given this, it seems we should align with the standard practice and
> report this metric to rx_missed_errors.
>
> Tariq, what are your thoughts?
mlx5 already reports rx_missed_errors and AFAIU rx_discards_phy are very
different kind of drops than the drops reported as 'missed'.
The distinction is useful in production in my experience working with
mlx5 devices.
Yes rx_missed_errors is lack of software descriptors, please don't mix it
with hardware pipeline FIFO discards.
FYI: mlx5 reports more discards related to pipeline see below,
especially for per PF/VF buffers. When these are advancing, usually they
indicate congestion control issues, for example pause frames is off.
rx_prio[p]_buf_discard
The number of packets discarded by device due to lack of per host receive buffers.
rx_prio[p]_cong_discard
The number of packets discarded by device due to per host congestion.
rx_prio[p]_discard (rx_discard_phy is the sum of all rx_prio[p]_discard)
The number of packets discarded by device due to lack of receive buffers.
That being said, these are not errors, so reporting them via rx_xyz_error
is very misleading, rx_missed_errors is a special case though, and let's
keep it that way.
From the manual [0], it says :
The number of received packets dropped due to lack of buffers on a
physical port. If this counter is increasing, it implies that the
adapter is congested and cannot absorb the traffic coming from the
network.
Would it be possible to add this description to if_link.h?
Frankly, it doesn’t make much difference to end users like me whether
this is reported to rx_missed_errors or rx_fifo_errors; the main goal
is simply to monitor this metric to flag any issues...
not rx_missed_errors please, it is exclusive for software lack of buffers.
Please have a look at thtool_eth_XXX_stats IEEE ethnl_stats, if you need to
extend, this is the place.
RFC2863[1] defines this type of discards as ifInDiscards. So let's add
it to ehttool std stats. mlx5 reports most of them already to driver custom
ethtool -S
[1] https://datatracker.ietf.org/doc/html/rfc2863
- Saeed
[0]. https://enterprise-support.nvidia.com/s/article/understanding-mlx5-ethtool-counters
--
Regards
Yafang