On Sun, Jun 5, 2022 at 4:59 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > On Fri, 3 Jun 2022 at 22:47, Arnd Bergmann <arnd@xxxxxxxx> wrote: > > > > On Fri, Jun 3, 2022 at 9:11 PM Yegor Yefremov > > <yegorslists@xxxxxxxxxxxxxx> wrote: > > > > > > With compiled-in drivers the system doesn't stall. All other tests and > > > related outputs will come next week. > > > > Ah, nice! > > > > It's probably a reasonable assumption that the smp-patched get_current() > > is (at least sometimes) broken in modules but working in the kernel itself. > > I suppose that means in the worst case we can hot-fix the issue by > > having an 'extern' version of get_current() for the case of > > armv6+smp+module ;-) > > > > I've coded something up along those lines, and pushed it to my > am335x-stall-test branch. > > > Maybe start with the ".long 0xe7f001f2" hack I suggested in my last > > mail. If that gives you an oops for the module case, then we know > > that the patching doesn't work at all and you don't have to try anything > > else, otherwise it's more likely that an incorrect instruction sequence > > is patched in. > > > > Yeah, I'd be really surprised if the patching misses some occurrences, > so I have no clue what is going on here. > > Yegor, can you please try my branch with the original config (i.e., > slcan and ftdio as modules) > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=am335x-stall-test @Arnd: I have applied your patch with this change: asm("0: .long 0xe7f001f2 \n\t" // BUG() trap But it revealed nothing new: [ 50.754130] rcu: INFO: rcu_sched self-detected stall on CPU [ 50.760834] rcu: 0-...!: (2600 ticks this GP) idle=ec9/1/0x40000004 softirq=1852/1852 fqs=0 [ 50.770407] (t=2600 jiffies g=2577 q=17) [ 50.775046] rcu: rcu_sched kthread timer wakeup didn't happen for 2599 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 50.786961] rcu: Possible timer handling issue on cpu=0 timer-softirq=872 [ 50.794429] rcu: rcu_sched kthread starved for 2600 jiffies! g2577 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0 [ 50.805403] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 50.814927] rcu: RCU grace-period kthread stack dump: [ 50.820464] task:rcu_sched state:I stack: 0 pid: 10 ppid: 2 flags:0x00000000 [ 50.830019] [<c0b683d4>] (__schedule) from [<c0b68d18>] (schedule+0x54/0xe8) [ 50.838470] [<c0b68d18>] (schedule) from [<c0b6f51c>] (schedule_timeout+0xa8/0x210) [ 50.847208] [<c0b6f51c>] (schedule_timeout) from [<c01d85b4>] (rcu_gp_fqs_loop+0x118/0x6b4) [ 50.856631] [<c01d85b4>] (rcu_gp_fqs_loop) from [<c01dc4e4>] (rcu_gp_kthread+0x138/0x30c) [ 50.865832] [<c01dc4e4>] (rcu_gp_kthread) from [<c0164df8>] (kthread+0x13c/0x164) [ 50.874315] [<c0164df8>] (kthread) from [<c0100140>] (ret_from_fork+0x14/0x34) [ 50.882477] rcu: Stack dump where RCU GP kthread last ran: [ 50.888512] NMI backtrace for cpu 0 [ 50.892575] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1 [ 50.899912] Hardware name: Generic AM33XX (Flattened Device Tree) [ 50.906610] Workqueue: events dbs_work_handler [ 50.912202] [<c0111600>] (unwind_backtrace) from [<c010bff4>] (show_stack+0x10/0x14) [ 50.921035] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0) [ 50.928943] NMI backtrace for cpu 0 [ 50.933084] CPU: 0 PID: 62 Comm: kworker/0:12 Not tainted 5.16.0-rc1 #1 [ 50.940419] Hardware name: Generic AM33XX (Flattened Device Tree) [ 50.947083] Workqueue: events dbs_work_handler [ 50.952574] [<c0111600>] (unwind_backtrace) from [<c010bff4>] (show_stack+0x10/0x14) [ 50.961334] [<c010bff4>] (show_stack) from [<d03919f0>] (0xd03919f0) @Ard: I have tried your branch (21b6671c82d4df52ea0c7837705331acb375c5c8). The system still stalls. Yegor