Re: unexpected result with rcu_nocbs option

Olivier Langlois <olivier@xxxxxxxxxxxxxx> · Thu, 01 Aug 2024 12:32:52 -0400

On Thu, 2024-08-01 at 08:28 -0700, Paul E. McKenney wrote:
> These handle grace periods, each for a group of CPUs.  You should
> have one
> rcuog kthread for each group of roughly sqrt(nr_cpu_ids) that
> contains at
> least one offloaded CPU, in your case, sqrt(4), which is 2.  You
> could
> use the rcutree.rcu_nocb_gp_stride kernel boot parameter to override
> this default, for example, you might want
> rcutree.rcu_nocb_gp_stride=4
> in your case.

oh, thanks. I will look into it to see it change anything to my
situation and if it does, I will report it back here.

with your explanation, it makes more sense what I am seeing.

> 
> > > I do have a
> > >      31     3 rcuos/1
> > > 
> > > I am not familiar enough with rcu to know what rcuos is for.
> 
> This is the kthread that invokes the callbacks for CPU 1, assuming
> you
> have a non-preemptible kernel (otherwise rcuop/1 for historical
> reasons
> that seemed like a good idea at the time).  Do you also have an
> rcuos/2?
> (See the help text for CONFIG_RCU_NOCB_CPU.)

yes I do.

$ ps -eo pid,cpuid,comm | grep rcu
      4     0 kworker/R-rcu_gp
      8     0 kworker/0:0-rcu_gp
     14     0 rcu_tasks_rude_kthread
     15     0 rcu_tasks_trace_kthread
     17     3 rcu_sched
     18     3 rcuog/0
     19     0 rcuos/0
     20     0 rcu_exp_par_gp_kthread_worker/0
     21     3 rcu_exp_gp_kthread_worker
     31     3 rcuos/1
     38     3 rcuog/2
     39     3 rcuos/2
     46     0 rcuos/3

and yes my kernel is built without CONFIG_PREEMPT. Since my system
consists of 3 isolated cpus out of 4, I have figured that there was not
much to preempt for the overhead coming along with the feature.

but here again, I am thankful for the cue... If all else fail, running
the setup with CONFIG_PREEMPT enabled to see if it change anything,
this might be something interesting to try.

> 
> > > the absence of of rcuog/1 is causing rcu_irq_work_resched() to
> > > raise
> > > an
> > > interrupt every 2-3 seconds on cpu1.
> 
> Did you build with CONFIG_LAZY_RCU=y?

no. I was not even aware that it was existing. I left alone the default
setting!
> 
> Did you use something like taskset to confine the rcuog and rcuos
> kthreads to CPUs 0 and 3 (you seem to have 4 CPUs)?

I have:
tuna move -t rcu* -c 3

in some bash script launched at startup by systemctl.

I was a bit suspicious to some of these kernel threads still on CPU0...

I did try to move them manually with taskset ie:

19     0 rcuos/0

$ sudo taskset -pc 3 19
pid 19's current affinity list: 3
pid 19's new affinity list: 3

so it appears that tuna does its job correctly. I guess that the
setting is effective only when the scheduler has something to do with
that task.

> 
> Might that interrupt be due to a call_rcu() on CPU 1?  If so, can the
> work causing that call_rcu() be placed on some other CPU?
> > > 
I am really shooting in the dark with this glitch. I don't know for
sure how to find out where these interrupts are originating from.

2 threads from the same process are assigned to cpu1 and cpu2.

It is very satisfying to see zero interrupts occuring on cpu2!
I have been struggling very hard for about a week to reach this state.

I have tried to find out what could be happening on cpu1 to see the IWI
and LOC counts slowly grow on CPU1...

the only task running on CPU1 is my thread. I have enabled
/sys/kernel/tracing/events/syscalls

the only 3 syscalls happening on this cpu are:
1. gettid()
2. futex()

from a glibc pthread_mutex used with a pthread_cond

3. madvise(MADV_DONTNEED)
(from tcmalloc, I think. I am even not sure if madvise(MADV_DONTNEED)
can trigger RCU. some part of mm do but I could not tell if this
specific call could)

all the memory allocation, I think is coming from openssl processing
about 20 TCP data streams. (Network I/O is done on cpu0)

beside that, I use clock_gettime() to benchmark the execution time of a
function of mine but it does not show up in the syscalls trace. I guess
this has something to do with vDSO.

here are my measurements (in nanosecs):

avg:125.080, max:27188

The function is simple. A C++ virtual function + a 16 bytes
Boost::atomic::load(std::memory_order::acquire) that is alone on a
cacheline to avoid any chance of false sharing.
(Boost support DWCAS. gcc std::atomic does not)

I think that the 27uSec monstruosity is because of those pesky
interrupts... 

> >