Re: slow sync rcu_tasks_trace

Alexei Starovoitov <ast@xxxxxx> · Thu, 10 Sep 2020 12:04:32 -0700

On 9/10/20 11:51 AM, Paul E. McKenney wrote:
On Thu, Sep 10, 2020 at 11:33:58AM -0700, Alexei Starovoitov wrote:
On 9/9/20 10:27 PM, Paul E. McKenney wrote:
On Wed, Sep 09, 2020 at 02:22:12PM -0700, Paul E. McKenney wrote:
On Wed, Sep 09, 2020 at 02:04:47PM -0700, Paul E. McKenney wrote:
On Wed, Sep 09, 2020 at 12:48:28PM -0700, Alexei Starovoitov wrote:
On Wed, Sep 09, 2020 at 12:39:00PM -0700, Paul E. McKenney wrote:

[ . . . ]

My plan is to try the following:

1.	Parameterize the backoff sequence so that RCU Tasks Trace
	uses faster rechecking than does RCU Tasks.  Experiment as
	needed to arrive at a good backoff value.

2.	If the tasks-list scan turns out to be a tighter bottleneck
	than the backoff waits, look into parallelizing this scan.
	(This seems unlikely, but the fact remains that RCU Tasks
	Trace must do a bit more work per task than RCU Tasks.)

3.	If these two approaches, still don't get the update-side
	latency where it needs to be, improvise.

The exact path into mainline will of course depend on how far down this
list I must go, but first to get a solution.

I think there is a case of 4. Nothing is inside rcu_trace critical section.
I would expect single ipi would confirm that.

Unless the task moves, yes.  So a single IPI should suffice in the
common case.

And what I am doing now is checking code paths.

And the following diff from a set of three patches gets my average
RCU Tasks Trace grace-period latencies down to about 20 milliseconds,
almost a 50x improvement from earlier today.

These are still quite rough and not yet suited for production use, but
I will be testing.  If that goes well, I hope to send a more polished
set of patches by end of day tomorrow, Pacific Time.  But if you get a
chance to test them, I would value any feedback that you might have.

These patches do not require hand-tuning, they instead adjust the
behavior according to CONFIG_TASKS_TRACE_RCU_READ_MB, which in turn
adjusts according to CONFIG_PREEMPT_RT.  So you should get the desired
latency reductions "out of the box", again, without tuning.

Great. Confirming improvement :)

time ./test_progs -t trampoline_count
#101 trampoline_count:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

real	0m2.897s
user	0m0.128s
sys	0m1.527s

This is without CONFIG_TASKS_TRACE_RCU_READ_MB, of course.

Good to hear, thank you!

or is more required?  I can tweak to get more.  There is never a free
lunch, though, and in this case the downside of further tweaking would
be greater CPU overhead.  Alternatively, I could just as easily tweak
it to be slower, thereby reducing the CPU overhead.

If I don't hear otherwise, I will assume that the current settings
work fine.

Now it looks like that sync rcu_tasks_trace is not slower than 
rcu_tasks, so if it would only makes sense to accelerate both at the 
same time.
I think for now it's good.

Of course, if people start removing thousands of BPF programs at one go,
I suspect that it will be necessary to provide a bulk-removal operation,
similar to some of the bulk-configuration-change operations provided by
networking.  The idea is to have a single RCU Tasks Trace grace period
cover all of the thousands of BPF removal operations.

bulk api won't really work for user space.
There is no good way to coordinate attaching different progs (or the 
same prog) to many different places.