Re: RCU_BOOST not working for me

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Sat, 18 Jan 2020 21:49:24 -0800

On Sat, Jan 18, 2020 at 08:58:12PM -0500, Joel Fernandes wrote:
> On Sat, Jan 18, 2020 at 02:47:08PM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 18, 2020 at 03:19:37PM -0500, Joel Fernandes wrote:
> > > On Fri, Jan 17, 2020 at 08:34:58PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 17, 2020 at 09:34:34PM -0500, Joel Fernandes wrote:
> > > > > On Fri, Jan 17, 2020 at 03:17:56PM -0800, Paul E. McKenney wrote:
> > > > > [...] 
> > > > > > But rcutorture already has tests for RCU priority boosting.  Or are those
> > > > > > failing in some way?
> > > > > 
> > > > > Yes there are tests, but I thought of just a simple experiment to study this.
> > > > > Purely since it is existing RCU kernel code that I'd like to understand. And
> > > > > me/Daniel are also looking into possibly using run-time / trace-based
> > > > > verification some of these behaviors.
> > > > 
> > > > The functionality of rcu_state.cbovld should make that more entertaining.
> > > > 
> > > > But I would guess that the initial model would ignore memory footprint
> > > > and just model RCU priority boosting as kicking in a fixed time after
> > > > the beginning of the grace period.
> > > > 
> > > > Or do you guys have something else in mind?
> > > 
> > > Yes, that is the idea. And then turn the model into a unit test (for the
> > > measurement). Though I am also personally trying to convince myself that a
> > > unit test based on a model is better than the test in the kernel module I
> > > just posted. We're just looking at applying Daniel's modeling work to
> > > verification of behaviors like these.
> > > 
> > > A poor-man's alternative of a model-based test is just making sure that
> > > synchronize_rcu() finishes in a bounded period of time (basically test by
> > > observation than test by model) similar to what my kernel module did.  But I
> > > guess a model based test would be more accurate and more strict about what is
> > > considered a pass vs fail.
> > 
> > In one sense, fair enough.
> > 
> > But in a more practical sense, why would anyone put synchronize_rcu()
> > anywhere near their real-time fastpaths?  Even synchronize_rcu_expedited()
> > would be a rather brave choice for such a fastpath.
> 
> Oh, I was just talking in the context of a unit test for boost, such as the
> one I wrote. By measuring synchronize_rcu() time in the previous test I
> wrote, we can get a sense of if the BOOST worked or not. Since the point of
> BOOST is to shorten the otherwise lengthy grace period.

OK, agreed, for testing this makes much more sense.  ;-)

> > > I was also studying SRCU and could not find tracepoints so I am thinking of
> > > adding some to aid the study. I know for Tree-SRCU you are using timers and
> > > workqueues but the concept hasn't largely changed since [1] was written
> > > right?
> > 
> > At one point I had tracepoints for SRCU on my list, but the discussions
> > of tracepoints possibly being user API scared me off.
> 
> I find it hard to imagine why any sane userspace tooling would want to depend
> on RCU tracepoints. If they are just debug scripts like the one we've been
> thinking of writing, then that's fine.
> 
> The latest on "tracepoints as user API" that I learnt from last conferences
> is, if a tracepoint is so popular that userspace tools are using it and known
> to use it, then that because ABI/API. If they are not used or unpopular, then
> they are not so much as API. I believe the existing RCU tracepoints already
> there, haven't shown to be an issue so may be it is Ok for RCU?

Still, caution seems warranted, at least until the code accumulates
a bit more time.

> > SRCU has been rewritten from scratch something like three times since
> > that article was published.  The current version is only a few years old.
> > And there is some motivation for more modifications due to the size of
> > the srcu_struct structure.  (Maybe dynamically allocating the srcu_node
> > array or some such, though this is not free of hazard and hassle, either.)
> > Thus far, all complaints about the large size have been handled by other
> > means, but there have been several such complaints.  In addition, the
> > use of workqueues is still a bit on the experimental side.  Looking good
> > thus far, but the code is yet young.
> > 
> > But yes, had the code remained unchanged for 14 years, there wouldn't
> > be much downside to adding tracepoints.  But the code is less than three
> > years old.
> 
> Interesting! Thanks for the history.

History I can provide in abundance.  ;-)

							Thanx, Paul