Le 12/09/2023 à 10:14, Paul E. McKenney a écrit : > On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote: >> * Paul E. McKenney <paulmck@xxxxxxxxxx> [230906 14:03]: >>> On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote: >>>> * Paul E. McKenney <paulmck@xxxxxxxxxx> [230906 13:24]: >>>>> On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote: >>>>>> (Adding Paul & Shanker to Cc list.. please see below for why) >>>>>> >>>>>> Apologies on the late response, I was away and have been struggling to >>>>>> get a working PPC32 test environment. >>>>>> >>>>>> * Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> [230829 12:42]: >>>>>>> Hi Liam, >>>>>>> >>>>>>> On Fri, 18 Aug 2023, Liam R. Howlett wrote: >>>>>>>> The current implementation of append may cause duplicate data and/or >>>>>>>> incorrect ranges to be returned to a reader during an update. Although >>>>>>>> this has not been reported or seen, disable the append write operation >>>>>>>> while the tree is in rcu mode out of an abundance of caution. >>>>>> >>>>>> ... >>>>>>>> >> >> ... >> >>>>>>> RCU-related configs: >>>>>>> >>>>>>> $ grep RCU .config >>>>>>> # RCU Subsystem >>>>>>> CONFIG_TINY_RCU=y > > I must have been asleep last time I looked at this. I was looking at > Tree RCU. Please accept my apologies for my lapse. :-/ > > However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would > have said the same thing, albeit after looking at a lot less RCU code. > > TL;DR: > > 1. Try making the __setup_irq() function's call to mutex_lock() > instead be as follows: > > if (!mutex_trylock(&desc->request_mutex)) > mutex_lock(&desc->request_mutex); > > This might fail if __setup_irq() has other dependencies on a > fully operational scheduler. > > 2. Move that ppc32 call to __setup_irq() much later, most definitely > after interrupts have been enabled and the scheduler is fully > operational. Invoking mutex_lock() before that time is not a > good idea. ;-) > > For more detail, please read on! > >>>>>>> # CONFIG_RCU_EXPERT is not set >>>>>>> CONFIG_TINY_SRCU=y >>>>>>> # end of RCU Subsystem >>>>>>> # RCU Debugging >>>>>>> # CONFIG_RCU_SCALE_TEST is not set >>>>>>> # CONFIG_RCU_TORTURE_TEST is not set >>>>>>> # CONFIG_RCU_REF_SCALE_TEST is not set >>>>>>> # CONFIG_RCU_TRACE is not set >>>>>>> # CONFIG_RCU_EQS_DEBUG is not set >>>>>>> # end of RCU Debugging >>>>>> >>>>>> I used the configuration from debian 8 and ran 'make oldconfig' to build >>>>>> my kernel. I have attached the configuration. >> >> ... >> >>>>>> It appears to be something to do with struct maple_tree sparse_irqs. If >>>>>> you drop the rcu flag from that maple tree, then my configuration boots >>>>>> without the warning. >>>>>> >>>>>> I *think* this is because we will reuse a lot more nodes. And I *think* >>>>>> the rcu flag is not needed, since there is a comment about reading the >>>>>> tree being protected by the mutex sparse_irq_lock within the >>>>>> kernel/irq/irqdesc.c file. Shanker, can you comment on that? >>>>>> >>>>>> I wonder if there is a limit to the number of RCU free events before >>>>>> something is triggered to flush them out which could trigger IRQ >>>>>> enabling? Paul, could this be the case? >>>>> >>>>> Are you asking if call_rcu() will re-enable interrupts in the following >>>>> use case? >>>>> >>>>> local_irq_disable(); >>>>> call_rcu(&p->rh, my_cb_func); >>>>> local_irq_enable(); >> >> I am not. >> >> ... >> >>>>> >>>>> Or am I missing your point? >>>> >>>> This is very early in the boot sequence when interrupts have not been >>>> enabled. What we are seeing is a WARN_ON() that is triggered by >>>> interrupts being enabled before they should be enabled. >>>> >>>> I was wondering if, for example, I called call_rcu() a lot *before* >>>> interrupts were enabled, that something could trigger that would either >>>> enable interrupts or indicate the task needs rescheduling? >>> >>> You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU >>> callback invocations won't happen until some time after the scheduler >>> starts up. >> >> I am not, it's just a detection of IRQs being enabled early. >> >>> >>>> Specifically the rescheduling part is suspect. I tracked down the call >>>> to a mutex_lock() which calls cond_resched(), so could rcu be >>>> 'encouraging' the rcu window by a reschedule request? >>> >>> During boot before interrupts are enabled, RCU has not yet spawned any of >>> its kthreads. Therefore, all of its attempts to do wakeups would notice >>> a NULL task_struct pointer and refrain from actually doing the wakeup. >>> If it did do the wakeup, you would see a NULL-pointer exception. See >>> for example, invoke_rcu_core_kthread(), though that won't happen unless >>> you booted with rcutree.use_softirq=0. >>> >>> Besides, since when did doing a wakeup enable interrupts? That would >>> make it hard to do wakeups from hardware interrupt handlers, not? >> >> Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a >> cond_resched(). >> >> >From what Michael said [1] in this thread, since something has already >> set TIF_NEED_RESCHED, it will eventually enable interrupts on us. >> >> I've traced this to running call_rcu() in kernel/rcu/tiny.c and >> is_idle_task(current) is true, which means rcu runs: >> /* force scheduling for rcu_qs() */ >> resched_cpu(0); >> >> the task is set idle in sched_init() -> init_idle() and never changed, >> afaict. > > Yes, because RCU eventually needs a context switch in order to make > a grace period happen. And Maple Tree isn't the only thing invoking > call_rcu() early, so this has been in place for a very long time. > >> Removing the RCU option from the maple tree in kernel/irq/irqdesc.c >> fixes the issue by avoiding the maple tree running call_rcu(). I am not >> sure on the locking of the tree so I feel this change may cause other >> issues...also it's before lockdep_init(), so any issue I introduce may >> not be detected. >> >> When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the >> same thing, but the IRQs are not enabled on return. So, resched_cpu(0) >> is called, but the IRQs warning of enabled isn't triggered. I failed to >> find a reason why. > > Here you mean IRQs being enabled upon return from __setup_irq(), correct? > > But yes, __setup_irq() does call mutex_lock(). Which will call > preempt_schedule_common() via might_sleep() and __cond_resched(), even > though that is clearly a very bad thing this early. And let's face it, > the whole purpose of mutex_lock() is to block when needed. And a big > purpose of that might_sleep() is to yell at people invoking this with > interrupts disabled. > > And most of the wrappers around __setup_irq() look to be intended > for much later, after interrupts have been enabled. One exception is > setup_percpu_irq(), which says that it is for "the early boot process", > whenever that might be. But this is only invoked from mips in v6.5. > > The __request_percpu_irq() function is wrappered by request_percpu_irq(), > and its header comment suggests that it is to be called after there are > multiple CPUs. I am not seeing a call that is obviously from ppc32, > but there are a number of drivers using this with which I am unfamiliar. > > The request_percpu_nmi() has to be followed up on each CPU using > prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that > it is useful to invoke this before interrupts are enabled. But this is > used by ARM, not ppc32 from what I can see. > > So even though I am not seeing how ppc32 invokes __setup_irq() early, > your testing clearly indicates that it is doing so. > >> I am not entirely sure what makes ppc32 different than other platforms >> in that the initial task is configured to an idle task and the first >> call to call_rcu (tiny!) would cause the observed behaviour. > > Maybe something like this in __setup_irq(), right before the > mutex_lock()? > > WARN_ON_ONCE(irqs_disabled()); > > This will dump the stack trace showing how __setup_irq() is being invoked > in early boot on ppc32. > > Again, given that __setup_irq() invokes mutex_lock(), invoking this > function in its current form before interrupts have been enabled is a > bad idea. > No trigger of that WARN_ON() I added in __setup_irq() as instructed above, still getting (pmac32_defconfig on MAC99 QEMU) ------------[ cut here ]------------ Interrupts were enabled early WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x4d8/0x5c0 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc1-dirty #481 Hardware name: PowerMac3,1 7400 0xc0209 PowerMac NIP: c0a6052c LR: c0a6052c CTR: 00000000 REGS: c0c4dee0 TRAP: 0700 Not tainted (6.6.0-rc1-dirty) MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24000282 XER: 20000000 GPR00: c0a6052c c0c4dfa0 c0b92580 0000001d c0b9d128 00000001 c0b9d148 3ffffdff GPR08: c0ba80f0 00000000 00000000 3ffffe00 44000282 00000000 00000000 0199abfc GPR16: 0199b0a4 7fde7fa4 7fc5ac0c 000000bb 41000000 01c690c8 c0b92014 c09b4bdc GPR24: c0c55220 c0ac0000 00000000 efff9109 efff9100 0000000a c0c6d000 c0b920a0 NIP [c0a6052c] start_kernel+0x4d8/0x5c0 LR [c0a6052c] start_kernel+0x4d8/0x5c0 Call Trace: [c0c4dfa0] [c0a6052c] start_kernel+0x4d8/0x5c0 (unreliable) [c0c4dff0] [00003540] 0x3540 Code: 480037b1 48023c05 4bab88ed 90620260 480139e9 4b657ced 7d2000a6 71298000 41a20014 3c60c09a 3863b78c 4b5e9ccd <0fe00000> 39200000 99380008 7d2000a6 ---[ end trace 0000000000000000 ]---