On Wed, Nov 01, 2023 at 07:11:54AM -1000, Linus Torvalds wrote: > On Tue, 31 Oct 2023 at 15:08, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > Here are the ways forward I can see: > > > > 1. Status quo. This has all the issues that you call out. > > People will hurt themselves with it and consume time and effort. > > So let's not do this. > > Well, at a *minimum*, I really want that notifier chain call to be > done *after* the core printk's. > > That way, if it deadlocks or does something else stupid, at least the > core printouts make it out. > > IOW, I think the notifier should be done perhaps just before the > "panic_on_rcu_stall()" call, not at the top before you've even > reported any stall conditions at all. Understood. But my problem is that the core printk()s destroy the state that the notifier is trying to output. > And yes, I think the trace_rcu_stall_warning() might be better off > later too, but at least trace events are things that get regular > testing in nasty conditions (including NMI etc), so I'm *much* less > worried about those than about "random developers who think they know > what they do and add a notifier". Agreed, this is a special debug facility, not something that anyone should use in production. And also not something that should be used where gdb would do the job. > And yes, I do think the notifier should be narrowed down a lot, if you > actually want to keep it. Understood, thus a new default-disabled Kconfig option that depends on RCU_EXPERT and DEBUG_KERNEL, along with a default-disabled kernel boot parameter, both of which have to be selected to make anything happen. > I did not actually hear you say that there is a good use-case for it. > I only saw you say "Those of us who need this", without showing *any* > kind of indication of why anybody would use it in reality. > > Why the secrecy? There is certainly no current user, nor any > description of what a user would be and what makes that notifier > useful. > > The commit message also just says "It is sometimes helpful" and some > strange reference to "the subsystem causing the stall to dump its > state". It all sounds very fishy. Why would anybody ever have a known > subsystem causing RCU stalls? Except, of course, for the rcutorture > testing. One use case is dumping out the qspinlock state for an extremely rare lockup. If you even look at the system cross-eyed, the lockup goes away. And yes, I should have mentioned this in the commit log, and I apologize for having failed to do so. I do not expect that the state-dump code would ever be appropriate for mainline. > Anyway, that all absolutely SCREAMS to me "this is not something > useful in any normal kernel", and so yes: Agreed, definitely not for any normal kernel! > > 3. Add a default-n Kconfig option that depends on RCU_EXPERT > > and KEBUG_KERNEL, so that these problems can only arise in > > specially built kernels. > > > > 4. Same as #3, but use a kernel boot parameter instead of a > > Kconfig option. > > let's make it clear that this is *not* something that any upstream > kernel would ever do, and the *only* possible use for it is some kind > of external temporary debug patch. > > See why I so hate things like this? Let's head off any crazy use long > *long* before somebody decides that "Oh, I want to use this". You are absolutely right, a debug tool with this many sharp edges should definitely not be default-enabled. And needs some scary words in the Kconfig help text. And a boot-time splat to make people think twice before using it. Apologies for not having thought this through! I will send a fixup patch before the end of today. Thanx, Paul