On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote: > * Paul E. McKenney <paulmck@xxxxxxxxxx> [230906 14:03]: > > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote: > > > * Paul E. McKenney <paulmck@xxxxxxxxxx> [230906 13:24]: > > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote: > > > > > (Adding Paul & Shanker to Cc list.. please see below for why) > > > > > > > > > > Apologies on the late response, I was away and have been struggling to > > > > > get a working PPC32 test environment. > > > > > > > > > > * Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> [230829 12:42]: > > > > > > Hi Liam, > > > > > > > > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote: > > > > > > > The current implementation of append may cause duplicate data and/or > > > > > > > incorrect ranges to be returned to a reader during an update. Although > > > > > > > this has not been reported or seen, disable the append write operation > > > > > > > while the tree is in rcu mode out of an abundance of caution. > > > > > > > > > > ... > > > > > > > > > ... > > > > > > > RCU-related configs: > > > > > > > > > > > > $ grep RCU .config > > > > > > # RCU Subsystem > > > > > > CONFIG_TINY_RCU=y I must have been asleep last time I looked at this. I was looking at Tree RCU. Please accept my apologies for my lapse. :-/ However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would have said the same thing, albeit after looking at a lot less RCU code. TL;DR: 1. Try making the __setup_irq() function's call to mutex_lock() instead be as follows: if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex); This might fail if __setup_irq() has other dependencies on a fully operational scheduler. 2. Move that ppc32 call to __setup_irq() much later, most definitely after interrupts have been enabled and the scheduler is fully operational. Invoking mutex_lock() before that time is not a good idea. ;-) For more detail, please read on! > > > > > > # CONFIG_RCU_EXPERT is not set > > > > > > CONFIG_TINY_SRCU=y > > > > > > # end of RCU Subsystem > > > > > > # RCU Debugging > > > > > > # CONFIG_RCU_SCALE_TEST is not set > > > > > > # CONFIG_RCU_TORTURE_TEST is not set > > > > > > # CONFIG_RCU_REF_SCALE_TEST is not set > > > > > > # CONFIG_RCU_TRACE is not set > > > > > > # CONFIG_RCU_EQS_DEBUG is not set > > > > > > # end of RCU Debugging > > > > > > > > > > I used the configuration from debian 8 and ran 'make oldconfig' to build > > > > > my kernel. I have attached the configuration. > > ... > > > > > > It appears to be something to do with struct maple_tree sparse_irqs. If > > > > > you drop the rcu flag from that maple tree, then my configuration boots > > > > > without the warning. > > > > > > > > > > I *think* this is because we will reuse a lot more nodes. And I *think* > > > > > the rcu flag is not needed, since there is a comment about reading the > > > > > tree being protected by the mutex sparse_irq_lock within the > > > > > kernel/irq/irqdesc.c file. Shanker, can you comment on that? > > > > > > > > > > I wonder if there is a limit to the number of RCU free events before > > > > > something is triggered to flush them out which could trigger IRQ > > > > > enabling? Paul, could this be the case? > > > > > > > > Are you asking if call_rcu() will re-enable interrupts in the following > > > > use case? > > > > > > > > local_irq_disable(); > > > > call_rcu(&p->rh, my_cb_func); > > > > local_irq_enable(); > > I am not. > > ... > > > > > > > > > Or am I missing your point? > > > > > > This is very early in the boot sequence when interrupts have not been > > > enabled. What we are seeing is a WARN_ON() that is triggered by > > > interrupts being enabled before they should be enabled. > > > > > > I was wondering if, for example, I called call_rcu() a lot *before* > > > interrupts were enabled, that something could trigger that would either > > > enable interrupts or indicate the task needs rescheduling? > > > > You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU > > callback invocations won't happen until some time after the scheduler > > starts up. > > I am not, it's just a detection of IRQs being enabled early. > > > > > > Specifically the rescheduling part is suspect. I tracked down the call > > > to a mutex_lock() which calls cond_resched(), so could rcu be > > > 'encouraging' the rcu window by a reschedule request? > > > > During boot before interrupts are enabled, RCU has not yet spawned any of > > its kthreads. Therefore, all of its attempts to do wakeups would notice > > a NULL task_struct pointer and refrain from actually doing the wakeup. > > If it did do the wakeup, you would see a NULL-pointer exception. See > > for example, invoke_rcu_core_kthread(), though that won't happen unless > > you booted with rcutree.use_softirq=0. > > > > Besides, since when did doing a wakeup enable interrupts? That would > > make it hard to do wakeups from hardware interrupt handlers, not? > > Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a > cond_resched(). > > >From what Michael said [1] in this thread, since something has already > set TIF_NEED_RESCHED, it will eventually enable interrupts on us. > > I've traced this to running call_rcu() in kernel/rcu/tiny.c and > is_idle_task(current) is true, which means rcu runs: > /* force scheduling for rcu_qs() */ > resched_cpu(0); > > the task is set idle in sched_init() -> init_idle() and never changed, > afaict. Yes, because RCU eventually needs a context switch in order to make a grace period happen. And Maple Tree isn't the only thing invoking call_rcu() early, so this has been in place for a very long time. > Removing the RCU option from the maple tree in kernel/irq/irqdesc.c > fixes the issue by avoiding the maple tree running call_rcu(). I am not > sure on the locking of the tree so I feel this change may cause other > issues...also it's before lockdep_init(), so any issue I introduce may > not be detected. > > When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the > same thing, but the IRQs are not enabled on return. So, resched_cpu(0) > is called, but the IRQs warning of enabled isn't triggered. I failed to > find a reason why. Here you mean IRQs being enabled upon return from __setup_irq(), correct? But yes, __setup_irq() does call mutex_lock(). Which will call preempt_schedule_common() via might_sleep() and __cond_resched(), even though that is clearly a very bad thing this early. And let's face it, the whole purpose of mutex_lock() is to block when needed. And a big purpose of that might_sleep() is to yell at people invoking this with interrupts disabled. And most of the wrappers around __setup_irq() look to be intended for much later, after interrupts have been enabled. One exception is setup_percpu_irq(), which says that it is for "the early boot process", whenever that might be. But this is only invoked from mips in v6.5. The __request_percpu_irq() function is wrappered by request_percpu_irq(), and its header comment suggests that it is to be called after there are multiple CPUs. I am not seeing a call that is obviously from ppc32, but there are a number of drivers using this with which I am unfamiliar. The request_percpu_nmi() has to be followed up on each CPU using prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that it is useful to invoke this before interrupts are enabled. But this is used by ARM, not ppc32 from what I can see. So even though I am not seeing how ppc32 invokes __setup_irq() early, your testing clearly indicates that it is doing so. > I am not entirely sure what makes ppc32 different than other platforms > in that the initial task is configured to an idle task and the first > call to call_rcu (tiny!) would cause the observed behaviour. Maybe something like this in __setup_irq(), right before the mutex_lock()? WARN_ON_ONCE(irqs_disabled()); This will dump the stack trace showing how __setup_irq() is being invoked in early boot on ppc32. Again, given that __setup_irq() invokes mutex_lock(), invoking this function in its current form before interrupts have been enabled is a bad idea. > Non-tiny rcu calls (as I am sure you know, but others may not) > kernel/rcu/tree.c which in turn calls __call_rcu_common(). That > function is far more complex than the tiny version. Maybe it's part of > why we see different behaviour based on platforms? I don't see an idle > check in that version of call_rcu(). Tree RCU can rely on the grace-period kthread starting up later in boot. This kthread will handle any need for a grace period from early boot invocation of call_rcu(). Tiny RCU does not have such a kthread, hence the resched_cpu(). Which is fine, as long as nothing (incorrectly!) invokes a blocking operation before the scheduler has been initialized. > Or maybe PPC32 has something set incorrectly to cause this failure in > early boot and I've just found something that needs to be set > differently? This might be a failure of imagination on my part, but I certainly don't see why ppc32 should expect to get away with invoking __setup_irq() anywhere near that early! ;-) But if ppc32 cannot be adjusted so as to invoke __setup_irq() later, the following change to __setup_irq() might work: if (!mutex_trylock(&desc->request_mutex)) mutex_lock(&desc->request_mutex); The reason for "might" rather than "will" is that __setup_irq() might well have other dependencies on the scheduler being up and running. It would be cleaner for that ppc32 call to __setup_irq() to be moved later, after the scheduler is up and running. > > But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas > > of greatest suspicion, starting from the stack trace generated by that > > mutex_lock()? A stray interrupt-enable could be pretty much anywhere. > > > > But where are those call_rcu() invocations? Before rcu_init()? > > During init_IRQ(), which is after rcu_init() but before rcu_init_nohz(), > srcu_init(), and softirq_init() in init/main.c start_kernel(). That is a perfectly legal time and place for a call_rcu() invocation. It is also legal before rcu_init(), but it has to be late enough that things like per-CPU variables are accessible. Which is still very early. > > Presumably before init is spawned and the early_init() calls. > > > > And what is the RCU-related Kconfig and boot-parameter setup? > > The .config was attached to the email I sent, and it matches what was > quoted above in the "RCU-related configs" section. > > [1] https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ Again, apologies for having misread that .config snippet!!! Thanx, Paul