Re: [GIT PULL] RCU changes for v6.7

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 1 Nov 2023 10:40:14 -0700

On Wed, Nov 01, 2023 at 07:11:54AM -1000, Linus Torvalds wrote:
> On Tue, 31 Oct 2023 at 15:08, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > Here are the ways forward I can see:
> >
> > 1.      Status quo.  This has all the issues that you call out.
> >         People will hurt themselves with it and consume time and effort.
> >         So let's not do this.
> 
> Well, at a *minimum*, I really want that notifier chain call to be
> done *after* the core printk's.
> 
> That way, if it deadlocks or does something else stupid, at least the
> core printouts make it out.
> 
> IOW, I think the notifier should be done perhaps just before the
> "panic_on_rcu_stall()" call, not at the top before you've even
> reported any stall conditions at all.

Understood.  But my problem is that the core printk()s destroy the state
that the notifier is trying to output.

> And yes, I think the trace_rcu_stall_warning() might be better off
> later too, but at least trace events are things that get regular
> testing in nasty conditions (including NMI etc), so I'm *much* less
> worried about those than about "random developers who think they know
> what they do and add a notifier".

Agreed, this is a special debug facility, not something that anyone
should use in production.  And also not something that should be used
where gdb would do the job.

> And yes, I do think the notifier should be narrowed down a lot, if you
> actually want to keep it.

Understood, thus a new default-disabled Kconfig option that depends on
RCU_EXPERT and DEBUG_KERNEL, along with a default-disabled kernel
boot parameter, both of which have to be selected to make anything
happen.

> I did not actually hear you say that there is a good use-case for it.
> I only saw you say "Those of us who need this", without showing *any*
> kind of indication of why anybody would use it in reality.
> 
> Why the secrecy? There is certainly no current user, nor any
> description of what a user would be and what makes that notifier
> useful.
> 
> The commit message also just says "It is sometimes helpful" and some
> strange reference to "the subsystem causing the stall to dump its
> state". It all sounds very fishy. Why would anybody ever have a known
> subsystem causing RCU stalls? Except, of course, for the rcutorture
> testing.

One use case is dumping out the qspinlock state for an extremely
rare lockup.  If you even look at the system cross-eyed, the lockup
goes away.  And yes, I should have mentioned this in the commit
log, and I apologize for having failed to do so.  I do not expect
that the state-dump code would ever be appropriate for mainline.

> Anyway, that all absolutely SCREAMS to me "this is not something
> useful in any normal kernel", and so yes:

Agreed, definitely not for any normal kernel!

> > 3.      Add a default-n Kconfig option that depends on RCU_EXPERT
> >         and KEBUG_KERNEL, so that these problems can only arise in
> >         specially built kernels.
> >
> > 4.      Same as #3, but use a kernel boot parameter instead of a
> >         Kconfig option.
> 
> let's make it clear that this is *not* something that any upstream
> kernel would ever do, and the *only* possible use for it is some kind
> of external temporary debug patch.
> 
> See why I so hate things like this? Let's head off any crazy use long
> *long* before somebody decides that "Oh, I want to use this".

You are absolutely right, a debug tool with this many sharp edges should
definitely not be default-enabled.  And needs some scary words in the
Kconfig help text.  And a boot-time splat to make people think twice
before using it.

Apologies for not having thought this through!

I will send a fixup patch before the end of today.

							Thanx, Paul