Re: [PATCH v4 net-next 2/2] net: introduce budget_squeeze to help us tune rx behavior

Jason Xing <kerneljasonxing@xxxxxxxxx> · Tue, 21 Mar 2023 10:08:39 +0800

On Mon, Mar 20, 2023 at 9:30 PM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
> On 17/03/2023 05.11, Jason Xing wrote:
> > On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
> >>
> >> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> >>>> That is the common case, and can be understood from the napi trace
> >>>
> >>> Thanks for your reply. It is commonly happening every day on many servers.
> >>
> >> Right but the common issue is the time squeeze, not budget squeeze,
> >
> > Most of them are about time, so yes.
> >
> >> and either way the budget squeeze doesn't really matter because
> >> the softirq loop will call us again soon, if softirq itself is
> >> not scheduled out.
> >>
>
[...]
> I agree, the budget squeeze count doesn't provide much value as it
> doesn't indicate something critical (softirq loop will call us again
> soon).  The time squeeze event is more critical and something that is
> worth monitoring.
>
> I see value in this patch, because it makes it possible monitor the time
> squeeze events.  Currently the counter is "polluted" by the budget
> squeeze, making it impossible to get a proper time squeeze signal.
> Thus, I see this patch as a fix to a old problem.
>
> Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>

Thanks for your acknowledgement. As you said, I didn't add more
functional change or performance related change, but make a little
change to previous output which needs to be more accurate. Even though
I would like to get it merged, I have to drop this patch and resend
another one. If maintainers really think it matters, I hope it will be
picked someday :)

>
> That said (see below), besides monitoring time squeeze counter, I
> recommend adding some BPF monitoring to capture latency issues...
>
> >> So if you want to monitor a meaningful event in your fleet, I think
> >> a better event to monitor is the number of times ksoftirqd was woken
> >> up and latency of it getting onto the CPU.
> >
> > It's a good point. Thanks for your advice.
>
> I'm willing to help you out writing a BPF-based tool that can help you
> identify the issue Jakub describe above. Of high latency from when
> softIRQ is raised until softIRQ processing runs on the CPU.
>
> I have this bpftrace script[1] available that does just that:
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>

A few days ago, I did the same thing with bcc tools to handle those
complicated issues. Your bt script looks much more concise. Thanks
anyway.

> Perhaps you can take the latency historgrams and then plot a heatmap[2]
> in your monitoring platform.
>
>   [2] https://www.brendangregg.com/heatmaps.html
>
> --Jesper
>