Re: Observation on NOHZ_FULL

Andrea Righi <andrea.righi@xxxxxxxxxxxxx> · Wed, 7 Feb 2024 17:31:02 +0100

On Wed, Feb 07, 2024 at 10:48:10AM -0500, Joel Fernandes wrote:
> 
> 
> On 2/6/2024 12:51 PM, Andrea Righi wrote:
> >  - stress-ng --matrix seems quite unpredictable to be used a benchmarks
> >    in this scenario (the bogo-ops/s are very susceptible to any kind of
> >    interference, so even if in the long runs NO_HZ_FULL still seems to
> >    provide some benefits looking at the average, we also need to
> >    consider that there might be a significant error in the measurements,
> >    standard deviation was pretty high)
> > 
> 
> Ack on the bogo-ops disclaimers as also mentioned in the stress-ng docs. Agreed
> a better metric for perf is helpful.
> 
> I am assuming you also have RCU_NOCB enabled for this test?

yep, booting with rcu_nocb=all.

> 
> >  - fio doing short writes (in page cache) seems to perform like 2x
> >    better in terms of iops with nohz_full, respect to the other cases
> >    and it performs 2x slower with large IO writes (not sure why... need
> >    to investigate more)
> 
> This is interesting, it could be worth counting how many kernel entries/exits
> occur for large IO vs small IO. I'd imagine for large IO we have fewer syscalls
> and hence lower entry/exit overhead. But if there more interrupts for whatever
> reason with large IO, then that also implies more kernel entries/exits. As
> Frederic was saying, NOHZ_FULL has higher overhead on kernel entry/exit.

Good point. I was expecting to see a reduced performance due to the
kernel entry/exit overhead, assuming we had more syscalls in the short
writes case, that's why I was confused. But, as you correctly pointed
out, we may actually have more entry/exit events due to interrupts.

I'll repeat the test counting the kernel entries/exits to figure out
what's happening exactly, thanks for the hint!

> 
> > 
> >  - with lazy RCU enabled hrtimer_interrupt() takes like 2x more to
> >    return, respect to the other cases (is this expected?)
> 
> It depends on which hrtimer_interrupt() instance? There must be several in the
> trace due to unrelated timers. Are you saying the worst case or it is always 2x
> more? We do queue a timer for Lazy RCU to flush the RCU work but it is set to 10
> seconds and should be canceled most of the time (Its just a guard rail). It is
> possible there is lock contention on ->nocb_gp_lock whizxzch is causing the timer
> handler execution to be slow. We have several trace_rcu_nocb* trace points,
> including for the timer. Perhaps you could enable those and we dig deeper?

So, my plan was to (try to) measure the potential overhead that we may
introduce by enabling the new config options in the tick handler,
therefore I tried to tace the time distribution of the different
hrtimer_interrupt() invocations, using this:

---
#!/usr/bin/bpftrace

kprobe:hrtimer_interrupt {
    @start[tid] = nsecs;
}

kretprobe:hrtimer_interrupt /@start[tid]/ {
    @elapsed = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
---

The time distribution is always ~2x with lazy RCU enabled, but honestly
I don't know if this metric is really relevant.

In terms of performance it is probably more relevant the fact that we
move the RCU callbacks from softirq to kthread context and process them
in batch. This for sure has a bigger impact respect to the overhead that
we may add to the tick handler.

But I'll try to enable the trace_rcu_nocb* tracepoints and see if we can
get some interesting details of what is happening.

> 
> Further, it is interesting to see if it is only the hrtimer_interrupt() instance
> that also results in a call to do_nocb_deferred_wakeup_timer() via say function
> tracing. That will confirm that it is the lazy timer that is slow for you.

ok

> 
> The actual number of callbacks should not be causing specifically the
> hrtimer_interrupt() to take too long to run, AFAICS. But RCU's lazy feature does
> increase the number of timer interrupts.
> 
> Further still, it depends on how much hrtimer_interrupt() takes with lazy RCU to
> call it a problem IMO. Some numbers with units will be nice.

This is what I see (this is a single run, but the other runs are
similar), unit is nanosec, with lazy RCU enabled hrtimer_interrupt()
takes around 4K-16K ns, with lazy RCU off most of the times it takes
2K-4K ns:

 - lazy rcu off:

[1K, 2K)       	 88307 |@@@@@@@@@@@@                                    	|
[2K, 4K)      	380695 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[4K, 8K)           194 |                                                	|

 - lazy rcu on:

[2K, 4K)          3094 |                                                	  |
[4K, 8K)        265763 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)       182341 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             	  |
[16K, 32K)        3422 |                                                	  |

Again, I'm not sure if this is really a problem or not, or if it is even
a relevant metric for the overall performance, I was just curious to
understand why it is different.

Thanks,
-Andrea