On Wed, 22 Jan 2020 18:26:47 -0800 "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote: > > Anyway, without this update, I added a printk to count optimizer > > queue-length and found that all optimizer call with a single kprobe > > on the quenes. I think this depends on the machine, but as far as > > I tested on 8-threads qemu x86, shows this result... > > > > Probes: 256 kprobe_events > > Enable events > > real 0m 0.02s > > user 0m 0.00s > > sys 0m 0.02s > > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 > > [ 17.739445] Queue-update: 1, go, Total Queued: 180 > > Disable events [...] > > [ 41.135594] Queue-update: 1, go, Total Queued: 1 > > real 0m 21.40s > > user 0m 0.00s > > sys 0m 0.04s > > So 21.4s/256 = 84 milliseconds per event disable, correct? Actually, it seems only 172 probes are on the unoptimized list, so the number will be a bit different. Anyway, that above elapsed time is including non-batch optimizer working time as below. (1) start looping on probe events (2) disabling-kprobe (2.1) wait kprobe_mutex if optimizer is running (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing list and kick optimizer with 5 jiffies delay (4) unlink enabled event (5) wait synchronize_rcu() (6) optimizer start optimization before finishing (5) (7) goto (1) I think the disabling performance issue came from (6) (and (2.1)). Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start after some loops are done. (and the optimizer detects "active" queuing, postpone the process) > > It might be worth trying synchronize_rcu_expedited() just as an experiment > to see if it speeds things up significantly. Would you mean replacing synchronize_rcu() in disabling loop, or replacing synchronize_rcu_tasks() in optimizer? I think that is not a root cause of this behavior, since if we make the optimizer delay to 1 sec, it seems enough for making it a batch operation. See below, this is the result with patched kernel (1 HZ delay). Probes: 256 kprobe_events Enable events real 0m 0.07s user 0m 0.00s sys 0m 0.07s [ 19.191181] Queue-update: 180, skipped, Total Queued: 180 Disable events [ 20.214966] Queue-update: 1, go, Total Queued: 172 [ 21.302924] Queue-update: 86, skipped, Total Queued: 86 real 0m 2.11s user 0m 0.00s sys 0m 0.03s [ 22.327173] Queue-update: 87, skipped, Total Queued: 172 [ 23.350933] Queue-update: 1, go, Total Queued: 172 Remove events real 0m 2.13s user 0m 0.02s sys 0m 0.02s As you can see, the optimizer ran outside of the disabling loop. In that case, it is OK to synchronize RCU tasks in the optimizer because it just runs *once* per multiple probe events. >From above result, 86 probes are disabled per 1 sec delay. Each probe disabling took 11-12 msec in average. So (HZ / 10) can also be good. (But note that the optimizer will retry to run each time until the disabling loop is finished.) BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100, the result will be different in the current code. So I think we should use HZ-based delay instead of fixed number. Thank you, -- Masami Hiramatsu <mhiramat@xxxxxxxxxx>