Re: unexpected result with rcu_nocbs option

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 1 Aug 2024 10:48:18 -0700

On Thu, Aug 01, 2024 at 12:32:52PM -0400, Olivier Langlois wrote:
> On Thu, 2024-08-01 at 08:28 -0700, Paul E. McKenney wrote:
> > These handle grace periods, each for a group of CPUs.  You should
> > have one
> > rcuog kthread for each group of roughly sqrt(nr_cpu_ids) that
> > contains at
> > least one offloaded CPU, in your case, sqrt(4), which is 2.  You
> > could
> > use the rcutree.rcu_nocb_gp_stride kernel boot parameter to override
> > this default, for example, you might want
> > rcutree.rcu_nocb_gp_stride=4
> > in your case.
> 
> oh, thanks. I will look into it to see it change anything to my
> situation and if it does, I will report it back here.
> 
> with your explanation, it makes more sense what I am seeing.
> 
> > > > I do have a
> > > >      31     3 rcuos/1
> > > > 
> > > > I am not familiar enough with rcu to know what rcuos is for.
> > 
> > This is the kthread that invokes the callbacks for CPU 1, assuming
> > you
> > have a non-preemptible kernel (otherwise rcuop/1 for historical
> > reasons
> > that seemed like a good idea at the time).  Do you also have an
> > rcuos/2?
> > (See the help text for CONFIG_RCU_NOCB_CPU.)
> 
> yes I do.
> 
> $ ps -eo pid,cpuid,comm | grep rcu
>       4     0 kworker/R-rcu_gp
>       8     0 kworker/0:0-rcu_gp
>      14     0 rcu_tasks_rude_kthread
>      15     0 rcu_tasks_trace_kthread
>      17     3 rcu_sched
>      18     3 rcuog/0
>      19     0 rcuos/0
>      20     0 rcu_exp_par_gp_kthread_worker/0
>      21     3 rcu_exp_gp_kthread_worker
>      31     3 rcuos/1
>      38     3 rcuog/2
>      39     3 rcuos/2
>      46     0 rcuos/3

This looks like you had either nohz_full=0-3 or rcu_nocbs=0-3, given
that you have rcuos kthreads for all four of your CPUs.  Or perhaps some
other setting that implied one or the other of these.

> and yes my kernel is built without CONFIG_PREEMPT. Since my system
> consists of 3 isolated cpus out of 4, I have figured that there was not
> much to preempt for the overhead coming along with the feature.
> 
> but here again, I am thankful for the cue... If all else fail, running
> the setup with CONFIG_PREEMPT enabled to see if it change anything,
> this might be something interesting to try.

To be honest, I am not sure that CONFIG_PREEMPT will help.  I was just
checking my understanding of your setup.  But it cannot hurt to try it.

> > > > the absence of of rcuog/1 is causing rcu_irq_work_resched() to
> > > > raise
> > > > an
> > > > interrupt every 2-3 seconds on cpu1.
> > 
> > Did you build with CONFIG_LAZY_RCU=y?
> 
> no. I was not even aware that it was existing. I left alone the default
> setting!

Worth a try, as this is what it is designed for.

> > Did you use something like taskset to confine the rcuog and rcuos
> > kthreads to CPUs 0 and 3 (you seem to have 4 CPUs)?
> 
> I have:
> tuna move -t rcu* -c 3
> 
> in some bash script launched at startup by systemctl.
> 
> I was a bit suspicious to some of these kernel threads still on CPU0...
> 
> I did try to move them manually with taskset ie:
> 
> 19     0 rcuos/0
> 
> $ sudo taskset -pc 3 19
> pid 19's current affinity list: 3
> pid 19's new affinity list: 3
> 
> so it appears that tuna does its job correctly. I guess that the
> setting is effective only when the scheduler has something to do with
> that task.

OK.

> > Might that interrupt be due to a call_rcu() on CPU 1?  If so, can the
> > work causing that call_rcu() be placed on some other CPU?
> > > > 
> I am really shooting in the dark with this glitch. I don't know for
> sure how to find out where these interrupts are originating from.
> 
> 2 threads from the same process are assigned to cpu1 and cpu2.
> 
> It is very satisfying to see zero interrupts occuring on cpu2!
> I have been struggling very hard for about a week to reach this state.
> 
> I have tried to find out what could be happening on cpu1 to see the IWI
> and LOC counts slowly grow on CPU1...
> 
> the only task running on CPU1 is my thread. I have enabled
> /sys/kernel/tracing/events/syscalls
> 
> the only 3 syscalls happening on this cpu are:
> 1. gettid()
> 2. futex()
> 
> from a glibc pthread_mutex used with a pthread_cond
> 
> 3. madvise(MADV_DONTNEED)
> (from tcmalloc, I think. I am even not sure if madvise(MADV_DONTNEED)
> can trigger RCU. some part of mm do but I could not tell if this
> specific call could)
> 
> all the memory allocation, I think is coming from openssl processing
> about 20 TCP data streams. (Network I/O is done on cpu0)
> 
> beside that, I use clock_gettime() to benchmark the execution time of a
> function of mine but it does not show up in the syscalls trace. I guess
> this has something to do with vDSO.
> 
> here are my measurements (in nanosecs):
> 
> avg:125.080, max:27188
> 
> The function is simple. A C++ virtual function + a 16 bytes
> Boost::atomic::load(std::memory_order::acquire) that is alone on a
> cacheline to avoid any chance of false sharing.
> (Boost support DWCAS. gcc std::atomic does not)
> 
> I think that the 27uSec monstruosity is because of those pesky
> interrupts... 

It is also worth using something like ftrace or bpftrace to trace
down what is instigating that extra interrupt, if you have not already
done so.  (Brendan Gregg's blog and other publications is a great source
of information on doing this sort of thing.)

There are also quite a few people who are much more experienced and
knowlegeable about reducing unwanted activity on worker CPUs than I am.
RCU I can help you with, usually, anyway.  ;-)

							Thanx, Paul