On Fri, Jun 3, 2022 at 10:54 AM Yegor Yefremov <yegorslists@xxxxxxxxxxxxxx> wrote: > On Thu, Jun 2, 2022 at 2:27 PM Yegor Yefremov <yegorslists@xxxxxxxxxxxxxx> wrote: > > On Thu, Jun 2, 2022 at 12:37 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > > That confirms my statement that smp.c cannot be the culprit, and > > > appears to exonerate the pure asm pieces. I wonder if this is related > > > to insufficient asm constraints on the C helpers, or just the cost > > > model taking different decisions because the inline asm string is much > > > longer. In any case, this opens up a couple of avenues we could > > > explore to narrow this down further. > > > > > > As a quick check, can you try the below snippet applied onto the > > > broken current.h build? > > > > > > --- a/arch/arm/include/asm/current.h > > > +++ b/arch/arm/include/asm/current.h > > > @@ -53,7 +53,8 @@ static __always_inline __attribute_const__ struct > > > task_struct *get_current(void) > > > " b . + (2b - 0b) \n\t" > > > " .popsection \n\t" > > > #endif > > > - : "=r"(cur)); > > > + : "=r"(cur) > > > + : "Q" (*(const unsigned long *)current_stack_pointer)); > > > > Where is the current_stack_pointer defined? > > > > > #elif __LINUX_ARM_ARCH__>= 7 || \ > > > !defined(CONFIG_ARM_HAS_GROUP_RELOCS) || \ > > > (defined(MODULE) && defined(CONFIG_ARM_MODULE_PLTS)) > > > > > > Given that the problematic sequence appears to be in C code, could you > > > please confirm whether or not the stall is reproducible when all the > > > pieces that are used by the CAN stack (musb, slcan, ftdio-sio, etc) > > > are built into the kernel rather than built as modules? Also, which > > > GCC version are you using? > > > > For now, the CAN stack parts are built as modules. I'll try to compile them in. > > > > I'm using GCC 10.x > > I have tried your patch (see the attachment) and the system stalls. This is with only get_current() patched on top of the working f0191ea5c2e5 ("[PART 1] ARM: implement THREAD_INFO_IN_TASK for uniprocessor systems"), right? My best theory right now is that something in get_currnent() is wrong that causes it to return the wrong task pointer, which in turn leads to current->preempt_count to get out of sync. This may be related to the cppi41 dmaengine tasklet and effectively disables further softirqs including the timer that triggers the RCU grace period. When we finally switch tasks to the cpufreq worker thread, softirqs can happen again because of the task switch, and at the next IRQ the timer detects the stall. > Will try GCC 11.x and also compiled-in drivers. Ok. Maybe make sure all drivers are built-in here. I see both the CAN layer and the cppi41 driver use softirqs, so to be on the safe side, try to get to a running kernel that has no modules loaded at all at the time you expect the stall. One thing that could possibly go wrong with get_current() would be that it fails to get patched for some reason, or it gets patched only after it was already called. Since you run on an ARMv7 CPU as opposed to an actual OMAP2410/ARM1136r0, it would then try to load the variable from the uninitialized TPIDRURO register. If that happens, the one-liner below should tell you exactly where, by triggering an Oops. You can apply the patch on top for testing, it should have no other effects if the patching part works correctly. Anrd 8<--- diff --git a/arch/arm/include/asm/current.h b/arch/arm/include/asm/current.h index 2f9d79214b25..2a15832793c4 100644 --- a/arch/arm/include/asm/current.h +++ b/arch/arm/include/asm/current.h @@ -33,7 +33,7 @@ static inline __attribute_const__ struct task_struct *get_current(void) */ cur = __builtin_thread_pointer(); #elif defined(CONFIG_CURRENT_POINTER_IN_TPIDRURO) || defined(CONFIG_SMP) - asm("0: mrc p15, 0, %0, c13, c0, 3 \n\t" + asm("0: .long 0xe7f001f2 \n\t" // BUG() trap #ifdef CONFIG_CPU_V6 "1: \n\t" " .subsection 1 \n\t"