Re: slow sync rcu_tasks_trace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 09, 2020 at 10:35:12AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 10:12:28AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 09, 2020 at 04:38:58AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 08, 2020 at 07:34:20PM -0700, Alexei Starovoitov wrote:
> > > > Hi Paul,
> > > > 
> > > > Looks like sync rcu_tasks_trace got slower or we simply didn't notice
> > > > it earlier.
> > > > 
> > > > In selftests/bpf try:
> > > > time ./test_progs -t trampoline_count
> > > > #101 trampoline_count:OK
> > > > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > > > 
> > > > real    1m17.082s
> > > > user    0m0.145s
> > > > sys    0m1.369s
> > > > 
> > > > so it's really something going on with sync rcu_tasks_trace.
> > > > Could you please take a look?
> > > 
> > > I am guessing that your .config has CONFIG_TASKS_TRACE_RCU_READ_MB=n.
> > > If I am wrong, please try CONFIG_TASKS_TRACE_RCU_READ_MB=y.
> > 
> > I've added
> > CONFIG_RCU_EXPERT=y
> > CONFIG_TASKS_TRACE_RCU_READ_MB=y
> > 
> > and it helped:
> > 
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real	0m8.924s
> > user	0m0.138s
> > sys	0m1.408s
> > 
> > But this is still bad. It's 4 times slower vs rcu_tasks
> > and isn't really usable for bpf, since it adds memory barriers exactly
> > where we need them removed.
> > 
> > In the default configuration rcu_tasks_trace is 40! times slower than rcu_tasks.
> > This huge difference in sync times concerns me a lot.
> > If bpf has to use memory barriers in rcu_read_lock_trace
> > and still be 4 times slower than rcu_tasks in the best case
> > then there is no much point in rcu_tasks_trace.
> > Converting everything to srcu would be better, but I really hope
> > you can find a solution to this tasks_trace issue.
> > 
> > > Otherwise (or alternatively), could you please try booting with
> > > rcupdate.rcu_task_ipi_delay=50?  The default value is 500, or half a
> > > second on a HZ=1000 system, which on a busy system could easily result
> > > in the grace-period delays that you are seeing.  The value of this
> > > kernel boot parameter does interact with the tasklist-scan backoffs,
> > > so its effect will not likely be linear.
> > 
> > The tests were run on freshly booted VM with 4 cpus. The VM is idle.
> > The host is idle too.
> > 
> > Adding rcupdate.rcu_task_ipi_delay=50 boot param sort-of helped:
> > time ./test_progs -t trampoline_count
> > #101 trampoline_count:OK
> > Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
> > 
> > real	0m25.890s
> > user	0m0.124s
> > sys	0m1.507s
> > It is still awful.
> > 
> > >From "perf report" there is little time spend in the kernel. The kernel is
> > waiting on something. I thought in theory the rcu_tasks_trace should have been
> > faster on update side vs rcu_tasks ? Could it be a bug somewhere and some
> > missing wakeup? It doesn't feel that it works as intended. Whatever it is
> > please try to reproduce it to remove me as a middle man.
> 
> On it.
> 
> To be fair, I was designing for a nominal one-second grace period,
> which was also the rough goal for rcu_tasks.
> 
> When do you need this by?
> 
> Left to myself, I will aim for the merge window after the upcoming one,
> and then backport to the prior -stable versions having RCU tasks trace.

That would be too late.
We would have to disable sleepable bpf progs or convert them to srcu.
bcc/bpftrace have a limit of 1000 probes for regexes to make sure
these tools don't add too many kprobes to the kernel at once.
Right now fentry/fexit/freplace are using trampoline which does
synchronize_rcu_tasks(). My measurements show that it's roughly
equal to synchronize_rcu() on idle box and perfectly capable to
be a replacement for kprobe based attaching.
It's not uncommon to attach a hundred kprobes or fentry probes at
a start time. So bpf trampoline has to be able to do 1000 in a second.
And it was the case before sleepable got added to the trampoline.
Now it's doing:
synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
and it's causing this massive slowdown which makes bpf trampoline
pretty much unusable and everything that builds on top suffers.
I can add a counter of sleepable progs to trampoline and do
either sync rcu_tasks or sync_mult(tasks, tasks_trace),
but we've discussed exactly that idea few months back and concluded that
rcu_tasks is likely to be heavier than rcu_tasks_trace, so I didn't
bother with the counter. I can still add it, but slow rcu_tasks_trace
means that sleepable progs are not usable due to slow startup time,
so have to do something with sleepable anyway.
So "when do you need this by?" the answer is asap.
I'm considering such changes to be a bugfix, not a feture.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux