Re: [RFT PATCH 04/13] kprobes: Make optimizer delay to 1 second

Masami Hiramatsu <mhiramat@xxxxxxxxxx> · Thu, 23 Jan 2020 15:13:28 +0900

On Wed, 22 Jan 2020 18:26:47 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:

> > Anyway, without this update, I added a printk to count optimizer
> > queue-length and found that all optimizer call with a single kprobe
> > on the quenes. I think this depends on the machine, but as far as
> > I tested on 8-threads qemu x86, shows this result...
> > 
> > Probes: 256 kprobe_events
> > Enable events
> > real	0m 0.02s
> > user	0m 0.00s
> > sys	0m 0.02s
> > [   17.730548] Queue-update: 180, skipped, Total Queued: 180
> > [   17.739445] Queue-update: 1, go, Total Queued: 180
> > Disable events
[...]
> > [   41.135594] Queue-update: 1, go, Total Queued: 1
> > real	0m 21.40s
> > user	0m 0.00s
> > sys	0m 0.04s
> 
> So 21.4s/256 = 84 milliseconds per event disable, correct?

Actually, it seems only 172 probes are on the unoptimized list, so
the number will be a bit different.

Anyway, that above elapsed time is including non-batch optimizer
working time as below.

(1) start looping on probe events
  (2) disabling-kprobe
    (2.1) wait kprobe_mutex if optimizer is running
    (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing
          list and kick optimizer with 5 jiffies delay
  (4) unlink enabled event
  (5) wait synchronize_rcu()
  (6) optimizer start optimization before finishing (5)
(7) goto (1)

I think the disabling performance issue came from (6) (and (2.1)).
Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start
after some loops are done. (and the optimizer detects "active"
queuing, postpone the process)

> 
> It might be worth trying synchronize_rcu_expedited() just as an experiment
> to see if it speeds things up significantly.

Would you mean replacing synchronize_rcu() in disabling loop, or
replacing synchronize_rcu_tasks() in optimizer?

I think that is not a root cause of this behavior, since if we
make the optimizer delay to 1 sec, it seems enough for making
it a batch operation. See below, this is the result with patched
kernel (1 HZ delay).

Probes: 256 kprobe_events
Enable events
real	0m 0.07s
user	0m 0.00s
sys	0m 0.07s
[   19.191181] Queue-update: 180, skipped, Total Queued: 180
Disable events
[   20.214966] Queue-update: 1, go, Total Queued: 172
[   21.302924] Queue-update: 86, skipped, Total Queued: 86
real	0m 2.11s
user	0m 0.00s
sys	0m 0.03s
[   22.327173] Queue-update: 87, skipped, Total Queued: 172
[   23.350933] Queue-update: 1, go, Total Queued: 172
Remove events
real	0m 2.13s
user	0m 0.02s
sys	0m 0.02s

As you can see, the optimizer ran outside of the disabling loop.
In that case, it is OK to synchronize RCU tasks in the optimizer
because it just runs *once* per multiple probe events.

>From above result, 86 probes are disabled per 1 sec delay.
Each probe disabling took 11-12 msec in average. So 
(HZ / 10) can also be good. (But note that the optimizer
will retry to run each time until the disabling loop is
finished.)

BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100,
the result will be different in the current code. So I think
we should use HZ-based delay instead of fixed number.

Thank you,

-- 
Masami Hiramatsu <mhiramat@xxxxxxxxxx>