On Fri, Jun 3, 2022 at 11:32 AM Arnd Bergmann <arnd@xxxxxxxx> wrote: > > On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov > <yegorslists@xxxxxxxxxxxxxx> wrote: > > On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@xxxxxxxxxxxxxx> wrote: > > > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > > > > That confirms my statement that smp.c cannot be the culprit, and > > > > appears to exonerate the pure asm pieces. I wonder if this is related > > > > to insufficient asm constraints on the C helpers, or just the cost > > > > model taking different decisions because the inline asm string is much > > > > longer. In any case, this opens up a couple of avenues we could > > > > explore to narrow this down further. > > > > > > > > As a quick check, can you try the below snippet applied onto the > > > > broken current.h build? > > > > > > > > --- a/arch/arm/include/asm/current.h > > > > +++ b/arch/arm/include/asm/current.h > > > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct > > > > task_struct *get_current(void) > > > > " b . + (2b - 0b) \n\t" > > > > " .popsection \n\t" > > > > #endif > > > > - : "=r"(cur)); > > > > + : "=r"(cur) > > > > + : "Q" (*(const unsigned long *)current_stack_pointer)); > > > > > > Where is the current_stack_pointer defined? > > > > > > > #elif __LINUX_ARM_ARCH__>= 7 || \ > > > > !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \ > > > > (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) > > > > > > > > Given that the problematic sequence appears to be in C code, could you > > > > please confirm whether or not the stall is reproducible when all the > > > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc) > > > > are built into the kernel rather than built as modules? Also, which > > > > GCC version are you using? > > > > > > For now, the CAN stack parts are built as modules. I'll try to compile them in. > > > > > > I'm using GCC 10.x > > > > I have tried your patch (see the attachment) and the system stalls. > > This is with only get_current() patched on top of the working > f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for > uniprocessor systems"), right? > > My best theory right now is that something in get_currnent() is wrong that > causes it to return the wrong task pointer, which in turn leads to > current->preempt_count to get out of sync. This may be related to the cppi41 > dmaengine tasklet and effectively disables further softirqs including the timer > that triggers the RCU grace period. > > When we finally switch tasks to the cpufreq worker thread, softirqs > can happen again because of the task switch, and at the next IRQ > the timer detects the stall. > > > Will try GCC 11.x and also compiled-in drivers. > > Ok. Maybe make sure all drivers are built-in here. I see both the CAN > layer and the cppi41 driver use softirqs, so to be on the safe side, > try to get to a running kernel that has no modules loaded at all at > the time you expect the stall. > > > One thing that could possibly go wrong with get_current() would be that > it fails to get patched for some reason, or it gets patched only after > it was already called. Since you run on an ARMv7 CPU as opposed to > an actual OMAP2410/ARM1136r0, it would then try to load the > variable from the uninitialized TPIDRURO register. If that happens, > the one-liner below should tell you exactly where, by triggering an > Oops. You can apply the patch on top for testing, it should have no > other effects if the patching part works correctly. > > Anrd > > 8<--- > > diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h > index 2f9d79214b25..2a15832793c4 100644 > --- a/arch/arm/include/asm/current.h > +++ b/arch/arm/include/asm/current.h > @@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct > *get_current(void) > */ > cur = __builtin_thread_pointer(); > #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP) > - asm("0: mrc p15, 0, %0, c13, c0, 3 \n\t" > + asm("0: .long 0xe7f001f2 \n\t" // BUG() trap > #ifdef CONFIG_CPU_V6 > "1: \n\t" > " .subsection 1 \n\t" With compiled-in drivers the system doesn't stall. All other tests and related outputs will come next week. Have a nice weekend. Yegor