Re: [PATCH v4 net-next 2/2] net: introduce budget_squeeze to help us tune rx behavior

Jason Xing <kerneljasonxing@xxxxxxxxx> · Fri, 17 Mar 2023 10:27:11 +0800

On Fri, Mar 17, 2023 at 8:20 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
>
> On Wed, 15 Mar 2023 17:20:41 +0800 Jason Xing wrote:
> > In our production environment, there're hundreds of machines hitting the
> > old time_squeeze limit often from which we cannot tell what exactly causes
> > such issues. Hitting limits aranged from 400 to 2000 times per second,
> > Especially, when users are running on the guest OS with veth policy
> > configured, it is relatively easier to hit the limit. After several tries
> > without this patch, I found it is only real time_squeeze not including
> > budget_squeeze that hinders the receive process.
>
[...]
> That is the common case, and can be understood from the napi trace

Thanks for your reply. It is commonly happening every day on many servers.

> point and probing the kernel with bpftrace. We should only add

We probably can deduce (or guess) which one causes the latency because
trace_napi_poll() only counts the budget consumed per poll.

Besides, tracing napi poll is totally ok with the testbed but not ok
with those servers with heavy load which bpftrace related tools
capturing the data from the hot path may cause some bad impact,
especially with special cards equipped, say, 100G nic card. Resorting
to legacy file softnet_stat is relatively feasible based on my limited
knowledge.

Paolo also added backlog queues into this file in 2020 (see commit:
7d58e6555870d). I believe that after this patch, there are few or no
more new data that is needed to print for the next few years.

> uAPI for statistics which must be maintained contiguously. For

In this patch, I didn't touch the old data as suggested in the
previous emails and only separated the old way of counting
@time_squeeze into two parts (time_squeeze and budget_squeeze). Using
budget_squeeze can help us profile the server and tune it more
usefully.

> investigations tracing will always be orders of magnitude more
> powerful :(

>
> On the time squeeze BTW, have you found out what the problem was?
> In workloads I've seen the time problems are often because of noise
> in how jiffies are accounted (cgroup code disables interrupts
> for long periods of time, for example, making jiffies increment
> by 2, 3 or 4 rather than by 1).

Yes ! The issue of jiffies increment troubles those servers more often
than not. For a small group of servers, budget limit is also a
problem. Sometimes we might treat guest OS differently.

Thanks,
Jason

>
> > So when we encounter some related performance issue and then get lost on
> > how to tune the budget limit and time limit in net_rx_action() function,
> > we can separately counting both of them to avoid the confusion.