On Mon, Mar 20, 2023 at 9:30 PM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > On 17/03/2023 05.11, Jason Xing wrote: > > On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > >> > >> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote: > >>>> That is the common case, and can be understood from the napi trace > >>> > >>> Thanks for your reply. It is commonly happening every day on many servers. > >> > >> Right but the common issue is the time squeeze, not budget squeeze, > > > > Most of them are about time, so yes. > > > >> and either way the budget squeeze doesn't really matter because > >> the softirq loop will call us again soon, if softirq itself is > >> not scheduled out. > >> > [...] > I agree, the budget squeeze count doesn't provide much value as it > doesn't indicate something critical (softirq loop will call us again > soon). The time squeeze event is more critical and something that is > worth monitoring. > > I see value in this patch, because it makes it possible monitor the time > squeeze events. Currently the counter is "polluted" by the budget > squeeze, making it impossible to get a proper time squeeze signal. > Thus, I see this patch as a fix to a old problem. > > Acked-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> Thanks for your acknowledgement. As you said, I didn't add more functional change or performance related change, but make a little change to previous output which needs to be more accurate. Even though I would like to get it merged, I have to drop this patch and resend another one. If maintainers really think it matters, I hope it will be picked someday :) > > That said (see below), besides monitoring time squeeze counter, I > recommend adding some BPF monitoring to capture latency issues... > > >> So if you want to monitor a meaningful event in your fleet, I think > >> a better event to monitor is the number of times ksoftirqd was woken > >> up and latency of it getting onto the CPU. > > > > It's a good point. Thanks for your advice. > > I'm willing to help you out writing a BPF-based tool that can help you > identify the issue Jakub describe above. Of high latency from when > softIRQ is raised until softIRQ processing runs on the CPU. > > I have this bpftrace script[1] available that does just that: > > [1] > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt > A few days ago, I did the same thing with bcc tools to handle those complicated issues. Your bt script looks much more concise. Thanks anyway. > Perhaps you can take the latency historgrams and then plot a heatmap[2] > in your monitoring platform. > > [2] https://www.brendangregg.com/heatmaps.html > > --Jesper >