RCU ideas discussed at LPC

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Wed, 25 Dec 2019 17:41:04 -0500

Hi Paul,
We were discussing some ideas on facebook so I wanted to just post
them here as well. This is in the context of the RCU section of RT MC
https://www.youtube.com/watch?v=bpyFQJV5gCI

Detecting high kfree_rcu() load
----------
You mentioned about this. As I understand it, we did the kfree_rcu()
batching to let the system not do anything RCU related until a batch
has filled up enough or a timeout has occurred. This makes the GP
thread and the system do less work.
The problem you are raising in our facebook thread is, that during
heavy load the "batch" can be large and be dumped into call_rcu()
eventually. Wouldn't this be better handled generically within
call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
That is if a large number of callbacks is dumped, then try to end the
GP more quickly. This likely doesn't need a signal from kfree_rcu()
since call_rcu() knows that it is being hammered.

Detecting recursive call_rcu() within call_rcu()
---------
We could use a per-cpu variable to detect a scenario like this, though
I am not sure if preemption during call_rcu() itself would cause false
positives.

All rcuogp and rcuop threads tied to a house keeping CPU
---
In LPC you mentioned about the problem of OOM if all rcuo* threads
including the GP one are not able to keep up with heavy load. On
Facebook I had proposed something like this: What about making the
affinity setting to be a "soft affinity", that is respect it always
expect in the uncommon case. In the uncommon case of heavy load, let
the threads run wherever to prevent OOM. Sure that might make the
system a little more disruptive, but if we are approaching OOM we have
bigger problems right?

Peter mentioned about rcuogp0 should have slightly higher prio than rcuop0
---------
You mentioned this is something to look into but not sure if we looked
into it yet.

A "heavy" call_rcu() caller using synchronize_rcu() if too many
callbacks are dumped
---------
How about doing this kind of call_rcu() to synchronize_rcu()
transition automatically if the context allows it? I.e. Detect the
context and if sleeping is allowed, then wait for the grace period
synchronously in call_rcu(). Not sure about deadlocks and the like
from this kind of waiting and have to think more.

is square root of N number of rcuogp0 threads - the right optimization?
---------
The question raised was can we do with fewer threads, or even just
one? You mentioned the square root might not be the right choice. How
do we test how well the system is doing. Are you running rcutorture
with a certain tree configuration and monitor memory footprint /
performance?

BTW, I have 2 interns working on RCU (Amol and Madupharna also on CC).
They were selected among several others as a part of the
LinuxFoundation mentorship program. They are familiar with RCU. I have
asked them to look at some RCU-list work and RCU sparse work. However,
I can also have them look into a few other things as time permits and
depending on what interests them.

Thanks, Merry Christmas!

 - Joel