Re: [RFT PATCH 04/13] kprobes: Make optimizer delay to 1 second

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 23 Jan 2020 02:03:18 -0800

On Thu, Jan 23, 2020 at 03:13:28PM +0900, Masami Hiramatsu wrote:
> On Wed, 22 Jan 2020 18:26:47 -0800
> "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
> 
> > > Anyway, without this update, I added a printk to count optimizer
> > > queue-length and found that all optimizer call with a single kprobe
> > > on the quenes. I think this depends on the machine, but as far as
> > > I tested on 8-threads qemu x86, shows this result...
> > > 
> > > Probes: 256 kprobe_events
> > > Enable events
> > > real	0m 0.02s
> > > user	0m 0.00s
> > > sys	0m 0.02s
> > > [   17.730548] Queue-update: 180, skipped, Total Queued: 180
> > > [   17.739445] Queue-update: 1, go, Total Queued: 180
> > > Disable events
> [...]
> > > [   41.135594] Queue-update: 1, go, Total Queued: 1
> > > real	0m 21.40s
> > > user	0m 0.00s
> > > sys	0m 0.04s
> > 
> > So 21.4s/256 = 84 milliseconds per event disable, correct?
> 
> Actually, it seems only 172 probes are on the unoptimized list, so
> the number will be a bit different.
> 
> Anyway, that above elapsed time is including non-batch optimizer
> working time as below.
> 
> (1) start looping on probe events
>   (2) disabling-kprobe
>     (2.1) wait kprobe_mutex if optimizer is running
>     (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing
>           list and kick optimizer with 5 jiffies delay
>   (4) unlink enabled event
>   (5) wait synchronize_rcu()
>   (6) optimizer start optimization before finishing (5)
> (7) goto (1)
> 
> I think the disabling performance issue came from (6) (and (2.1)).
> Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start
> after some loops are done. (and the optimizer detects "active"
> queuing, postpone the process)
> 
> > 
> > It might be worth trying synchronize_rcu_expedited() just as an experiment
> > to see if it speeds things up significantly.
> 
> Would you mean replacing synchronize_rcu() in disabling loop, or
> replacing synchronize_rcu_tasks() in optimizer?

The former.  (There is no synchronize_rcu_tasks_expedited().)

> I think that is not a root cause of this behavior, since if we
> make the optimizer delay to 1 sec, it seems enough for making
> it a batch operation. See below, this is the result with patched
> kernel (1 HZ delay).
> 
> Probes: 256 kprobe_events
> Enable events
> real	0m 0.07s
> user	0m 0.00s
> sys	0m 0.07s
> [   19.191181] Queue-update: 180, skipped, Total Queued: 180
> Disable events
> [   20.214966] Queue-update: 1, go, Total Queued: 172
> [   21.302924] Queue-update: 86, skipped, Total Queued: 86
> real	0m 2.11s
> user	0m 0.00s
> sys	0m 0.03s
> [   22.327173] Queue-update: 87, skipped, Total Queued: 172
> [   23.350933] Queue-update: 1, go, Total Queued: 172
> Remove events
> real	0m 2.13s
> user	0m 0.02s
> sys	0m 0.02s
> 
> As you can see, the optimizer ran outside of the disabling loop.
> In that case, it is OK to synchronize RCU tasks in the optimizer
> because it just runs *once* per multiple probe events.

Even better!

> >From above result, 86 probes are disabled per 1 sec delay.
> Each probe disabling took 11-12 msec in average. So 
> (HZ / 10) can also be good. (But note that the optimizer
> will retry to run each time until the disabling loop is
> finished.)
> 
> BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100,
> the result will be different in the current code. So I think
> we should use HZ-based delay instead of fixed number.

Agreed, the HZ-based delay seems worth a try.

							Thanx, Paul