Re: Kernel instability using Marvell mv88e6xxx DSA

Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> · Fri, 5 Jul 2024 16:56:17 +0200

On 2024-07-01 08:23:26 [+0200], Riccardo Laiolo wrote:
> Hi, sorry for my late reply,
Hi,

> Enabling CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_PROVE_LOCKING the kernel image became too big and overlapped the DT once loaded.
> So I went through menuconfig and disabled many unused drivers and features.
> 
> Then for some days, I couldn't get any kernel panics at all (I think I hadn't left the system running long enough).
> In the past week, I've collected the three attached panic logs.
> I can't see any correlation among the logs. I'd say there are some hardware issues,
> but it couldn't be the case since I know the same board works fine with a non-RT image.
> 
> [ 2857.996307] ------------[ cut here ]------------
> [ 2857.996316] Current state: 0
> [ 2857.996336] WARNING: CPU: 0 PID: 0 at kernel/time/clockevents.c:319 clockevents_program_event+0x124/0x130

This is odd. According to this warning, the clockevent device unused
(CLOCK_EVT_STATE_DETACHED).

> [ 2857.996479] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800009d41500
> [ 2857.996488] Call trace:
> [ 2857.996489]  clockevents_program_event+0x124/0x130
> [ 2857.996493]  tick_program_event+0x58/0xa4
> [ 2857.996499]  hrtimer_start_range_ns+0x304/0x34c
> [ 2857.996506]  tick_nohz_stop_tick+0x108/0x1d0
> [ 2857.996511]  tick_nohz_idle_stop_tick+0x78/0xd4
> [ 2857.996516]  do_idle+0x244/0x310
…
> [ 2857.996592] Unable to handle kernel execute from non-executable memory at virtual address ffff80000aa4bc20
…
> [ 2857.996666] pstate: a00000c5 (NzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 2857.996671] pc : __key.3+0x0/0x10
> [ 2857.996678] lr : clockevents_program_event+0xa8/0x130
…
> [ 2857.996747] Call trace:
> [ 2857.996748]  __key.3+0x0/0x10
> [ 2857.996753]  tick_program_event+0x58/0xa4
> [ 2857.996757]  hrtimer_start_range_ns+0x304/0x34c
> [ 2857.996763]  tick_nohz_stop_tick+0x108/0x1d0
> [ 2857.996768]  tick_nohz_idle_stop_tick+0x78/0xd4
> [ 2857.996773]  do_idle+0x244/0x310
And this occurred right after. That `__key' should be in the data
section, not .text. I guess it jumped to the wrong thing but then the
whole struct clock_event_device is probably garbage.

> [ 4386.503700] Unable to handle kernel paging request at virtual address 0000000000003fb8
Okay. NULL pointer…
> [ 4386.503733] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0
another…
…
> [ 4386.504061] Insufficient stack space to handle exception!
finally an end

> [ 4386.504174] Kernel panic - not syncing: kernel stack overflow
…
> [ 4386.504178] SMP: stopping secondary CPUs
> [ 4387.504180] SMP: failed to stop secondary CPUs 0-2
> [ 4387.504188] Kernel Offset: disabled
> [ 4387.504190] CPU features: 0x00000,00800084,0000420b
> [ 4387.504192] Memory Limit: none
> [ 4387.504197] 
> [ 4387.504198] ================================
> [ 4387.504199] WARNING: inconsistent lock state
> [ 4387.504201] 6.1.55-rt16 #1 Tainted: G        W         
> [ 4387.504203] --------------------------------
This report does not make sense…
…

> [ 1073.126275] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
let me ignore this one
…
> [ 1247.450882] BUG: spinlock bad magic on CPU#1, ktimers/1/25
> [ 1247.450894]  lock: 0xffff0001f6fa8a08, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0

magic zero? Not initialized?

…
> [ 1247.450917] Call trace:
…
> [ 1247.450952]  do_raw_spin_lock+0x108/0x130
> [ 1247.450959]  _raw_spin_lock_irqsave+0x78/0xb0
> [ 1247.450965]  rt_spin_lock+0x64/0x10c
> [ 1247.450970]  __run_timers+0x60/0x3c0

This is likely to be the timer_base::lock which is unlikely to be not
initialized.