On Thu, Jan 23, 2020 at 03:13:28PM +0900, Masami Hiramatsu wrote: > On Wed, 22 Jan 2020 18:26:47 -0800 > "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote: > > > > Anyway, without this update, I added a printk to count optimizer > > > queue-length and found that all optimizer call with a single kprobe > > > on the quenes. I think this depends on the machine, but as far as > > > I tested on 8-threads qemu x86, shows this result... > > > > > > Probes: 256 kprobe_events > > > Enable events > > > real 0m 0.02s > > > user 0m 0.00s > > > sys 0m 0.02s > > > [ 17.730548] Queue-update: 180, skipped, Total Queued: 180 > > > [ 17.739445] Queue-update: 1, go, Total Queued: 180 > > > Disable events > [...] > > > [ 41.135594] Queue-update: 1, go, Total Queued: 1 > > > real 0m 21.40s > > > user 0m 0.00s > > > sys 0m 0.04s > > > > So 21.4s/256 = 84 milliseconds per event disable, correct? > > Actually, it seems only 172 probes are on the unoptimized list, so > the number will be a bit different. > > Anyway, that above elapsed time is including non-batch optimizer > working time as below. > > (1) start looping on probe events > (2) disabling-kprobe > (2.1) wait kprobe_mutex if optimizer is running > (2.2) if the kprobe is on optimized kprobe, queue it to unoptimizing > list and kick optimizer with 5 jiffies delay > (4) unlink enabled event > (5) wait synchronize_rcu() > (6) optimizer start optimization before finishing (5) > (7) goto (1) > > I think the disabling performance issue came from (6) (and (2.1)). > Thus, if we change (2.2) to 1 HZ jiffies, the optimizer will start > after some loops are done. (and the optimizer detects "active" > queuing, postpone the process) > > > > > It might be worth trying synchronize_rcu_expedited() just as an experiment > > to see if it speeds things up significantly. > > Would you mean replacing synchronize_rcu() in disabling loop, or > replacing synchronize_rcu_tasks() in optimizer? The former. (There is no synchronize_rcu_tasks_expedited().) > I think that is not a root cause of this behavior, since if we > make the optimizer delay to 1 sec, it seems enough for making > it a batch operation. See below, this is the result with patched > kernel (1 HZ delay). > > Probes: 256 kprobe_events > Enable events > real 0m 0.07s > user 0m 0.00s > sys 0m 0.07s > [ 19.191181] Queue-update: 180, skipped, Total Queued: 180 > Disable events > [ 20.214966] Queue-update: 1, go, Total Queued: 172 > [ 21.302924] Queue-update: 86, skipped, Total Queued: 86 > real 0m 2.11s > user 0m 0.00s > sys 0m 0.03s > [ 22.327173] Queue-update: 87, skipped, Total Queued: 172 > [ 23.350933] Queue-update: 1, go, Total Queued: 172 > Remove events > real 0m 2.13s > user 0m 0.02s > sys 0m 0.02s > > As you can see, the optimizer ran outside of the disabling loop. > In that case, it is OK to synchronize RCU tasks in the optimizer > because it just runs *once* per multiple probe events. Even better! > >From above result, 86 probes are disabled per 1 sec delay. > Each probe disabling took 11-12 msec in average. So > (HZ / 10) can also be good. (But note that the optimizer > will retry to run each time until the disabling loop is > finished.) > > BTW, testing kernel was build with HZ=1000, if HZ=250 or HZ=100, > the result will be different in the current code. So I think > we should use HZ-based delay instead of fixed number. Agreed, the HZ-based delay seems worth a try. Thanx, Paul