On 2024-07-01 08:23:26 [+0200], Riccardo Laiolo wrote: > Hi, sorry for my late reply, Hi, > Enabling CONFIG_DEBUG_ATOMIC_SLEEP and CONFIG_PROVE_LOCKING the kernel image became too big and overlapped the DT once loaded. > So I went through menuconfig and disabled many unused drivers and features. > > Then for some days, I couldn't get any kernel panics at all (I think I hadn't left the system running long enough). > In the past week, I've collected the three attached panic logs. > I can't see any correlation among the logs. I'd say there are some hardware issues, > but it couldn't be the case since I know the same board works fine with a non-RT image. > > [ 2857.996307] ------------[ cut here ]------------ > [ 2857.996316] Current state: 0 > [ 2857.996336] WARNING: CPU: 0 PID: 0 at kernel/time/clockevents.c:319 clockevents_program_event+0x124/0x130 This is odd. According to this warning, the clockevent device unused (CLOCK_EVT_STATE_DETACHED). > [ 2857.996479] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800009d41500 > [ 2857.996488] Call trace: > [ 2857.996489] clockevents_program_event+0x124/0x130 > [ 2857.996493] tick_program_event+0x58/0xa4 > [ 2857.996499] hrtimer_start_range_ns+0x304/0x34c > [ 2857.996506] tick_nohz_stop_tick+0x108/0x1d0 > [ 2857.996511] tick_nohz_idle_stop_tick+0x78/0xd4 > [ 2857.996516] do_idle+0x244/0x310 … > [ 2857.996592] Unable to handle kernel execute from non-executable memory at virtual address ffff80000aa4bc20 … > [ 2857.996666] pstate: a00000c5 (NzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 2857.996671] pc : __key.3+0x0/0x10 > [ 2857.996678] lr : clockevents_program_event+0xa8/0x130 … > [ 2857.996747] Call trace: > [ 2857.996748] __key.3+0x0/0x10 > [ 2857.996753] tick_program_event+0x58/0xa4 > [ 2857.996757] hrtimer_start_range_ns+0x304/0x34c > [ 2857.996763] tick_nohz_stop_tick+0x108/0x1d0 > [ 2857.996768] tick_nohz_idle_stop_tick+0x78/0xd4 > [ 2857.996773] do_idle+0x244/0x310 And this occurred right after. That `__key' should be in the data section, not .text. I guess it jumped to the wrong thing but then the whole struct clock_event_device is probably garbage. > [ 4386.503700] Unable to handle kernel paging request at virtual address 0000000000003fb8 Okay. NULL pointer… > [ 4386.503733] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b0 another… … > [ 4386.504061] Insufficient stack space to handle exception! finally an end > [ 4386.504174] Kernel panic - not syncing: kernel stack overflow … > [ 4386.504178] SMP: stopping secondary CPUs > [ 4387.504180] SMP: failed to stop secondary CPUs 0-2 > [ 4387.504188] Kernel Offset: disabled > [ 4387.504190] CPU features: 0x00000,00800084,0000420b > [ 4387.504192] Memory Limit: none > [ 4387.504197] > [ 4387.504198] ================================ > [ 4387.504199] WARNING: inconsistent lock state > [ 4387.504201] 6.1.55-rt16 #1 Tainted: G W > [ 4387.504203] -------------------------------- This report does not make sense… … > [ 1073.126275] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: let me ignore this one … > [ 1247.450882] BUG: spinlock bad magic on CPU#1, ktimers/1/25 > [ 1247.450894] lock: 0xffff0001f6fa8a08, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 magic zero? Not initialized? … > [ 1247.450917] Call trace: … > [ 1247.450952] do_raw_spin_lock+0x108/0x130 > [ 1247.450959] _raw_spin_lock_irqsave+0x78/0xb0 > [ 1247.450965] rt_spin_lock+0x64/0x10c > [ 1247.450970] __run_timers+0x60/0x3c0 This is likely to be the timer_base::lock which is unlikely to be not initialized.