Update: Issue with CPU idle was found. It was related to our SoC changes in timers interrupt connectivity (which makes sense :-)). Marry XMAS all. > -----Original Message----- > From: Lior Weintraub > Sent: Sunday, December 24, 2023 9:12 PM > To: hs@xxxxxxx; Dirk Behme <dirk.behme@xxxxxxxxx> > Cc: linux-embedded@xxxxxxxxxxxxxxx > Subject: RE: Debugging early SError exception > > Update: > UART issue ("unable to open an initial console") was resolved. > I was missing CONFIG_SERIAL_8250_DW=y on my config. > > Now only issue left is the CPU idle ("wfi") and no interrupts are coming. > > > -----Original Message----- > > From: Lior Weintraub > > Sent: Sunday, December 24, 2023 5:42 PM > > To: hs@xxxxxxx; Dirk Behme <dirk.behme@xxxxxxxxx> > > Cc: linux-embedded@xxxxxxxxxxxxxxx > > Subject: RE: Debugging early SError exception > > > > Hi, > > > > The GICv3 issue was resolved after: > > 1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization > support > > and hence ICC_SRE_EL2 is not supported). > > 2. Power up the GICR on EL3 > > > > The earlycon issue was resolved after: > > 1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot > > args. > > 2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only > > CONFIG_SERIAL_8250=y) > > > > Now I face a new issue: > > Linux boot hangs on "wait for interrupt" at cpu_do_idle. > > > > The program counter is stuck at 0xffff8000805ae45c. > > ffff8000805ae454 <cpu_do_idle>: > > ffff8000805ae454: d5033f9f dsb sy > > ffff8000805ae458: d503207f wfi > > ffff8000805ae45c: d65f03c0 ret > > > > I think that something is wrong with the timers or gic setting and as a result > > the scheduler doesn't get the interrupts (timer ticks). > > > > Additional info that might be relevant to this issue: > > The emulation platform runs at about 2.8MHz. > > The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq > > varies between 1.9-2.8MHz). > > The reason for those settings is to allow Linux to run as it would on the "real" > > world. > > > > It is my understanding that there are 2 issues here: > > 1. Something is wrong with Timers\Interrupt setting (note that same > > configuration runs correctly on QEMU) > > 2. Something is wrong with initramfs - according kernel source it seems to > fail > > to open "/dev/console" > > > > The full Linux boot log: > > Booting Linux on physical CPU 0x0000000000 [0x410fd034] > > Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu- > > gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU > > Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023 > > Machine model: Pliops Spider MK-I EVK > > earlycon: uart8250 at MMIO32 0x000000d000307000 (options > '115200n8') > > printk: bootconsole [uart8250] enabled > > efi: UEFI not found. > > Zone ranges: > > DMA [mem 0x0000000000000000-0x000000002fffffff] > > DMA32 empty > > Normal empty > > Movable zone start for each node > > Early memory node ranges > > node 0: [mem 0x0000000000000000-0x000000002fffffff] > > Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff] > > percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400 > > Detected VIPT I-cache on CPU0 > > CPU features: detected: GIC system register CPU interface > > CPU features: detected: ARM erratum 845719 > > alternatives: applying boot alternatives > > Kernel command line: console=ttyS0,115200n8 > > earlycon=uart8250,mmio32,0xd000307000,115200n8 > > Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear) > > Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear) > > Built 1 zonelists, mobility grouping on. Total pages: 193536 > > mem auto-init: stack:off, heap alloc:off, heap free:off > > software IO TLB: area num 1. > > software IO TLB: mapped [mem 0x000000002b080000- > > 0x000000002f080000] (64MB) > > Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata, > > 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved) > > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 > > trace event string verifier disabled > > rcu: Hierarchical RCU implementation. > > rcu: RCU event tracing is enabled. > > rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1. > > rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies. > > rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1 > > NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 > > GICv3: 96 SPIs implemented > > GICv3: 0 Extended SPIs implemented > > Root IRQ handler: gic_handle_irq > > GICv3: GICv3 features: 16 PPIs > > GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000 > > ITS [mem 0xe000040000-0xe00005ffff] > > ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz > 8, > > psz 64K, shr 1) > > ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000 > > (flat, esz 2, psz 64K, shr 1) > > GICv3: Expected reserved range > > [0x00000000000c0000:0x00000000000cffff], not found > > GICv3: using LPI property table @0x00000000000c0000 > > GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted > > CPU0: Failed to disable LPIs > > rcu: srcu_init: Setting srcu_struct sizes based on contention. > > arch_timer: cp15 timer(s) running at 62.50MHz (virt). > > clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles: > > 0x1cd42e208c, max_idle_ns: 881590405314 ns > > sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every > > 4398046511096ns > > Console: colour dummy device 80x25 > > Calibrating delay loop (skipped), value calculated using timer frequency.. > > 125.00 BogoMIPS (lpj=250000) > > pid_max: default: 32768 minimum: 301 > > Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) > > Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) > > cacheinfo: Unable to detect cache hierarchy for CPU 0 > > rcu: Hierarchical SRCU implementation. > > rcu: Max phase no-delay instances is 1000. > > Platform MSI: gic-its@E000040000 domain created > > PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000 > > domain created > > EFI services will not be available. > > smp: Bringing up secondary CPUs ... > > smp: Brought up 1 node, 1 CPU > > SMP: Total of 1 processors activated. > > CPU features: detected: 32-bit EL0 Support > > CPU features: detected: CRC32 instructions > > CPU: All CPU(s) started at EL1 > > alternatives: applying system-wide alternatives > > devtmpfs: initialized > > clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: > > 7645041785100000 ns > > futex hash table entries: 256 (order: 2, 16384 bytes, linear) > > DMI not present or invalid. > > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations > > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic > > allocations > > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic > > allocations > > hw-breakpoint: found 6 breakpoint and 4 watchpoint registers. > > ASID allocator initialised with 65536 entries > > Serial: AMBA PL011 UART driver > > Modules: 30080 pages in range for non-PLT usage > > Modules: 521600 pages in range for PLT usage > > iommu: Default domain type: Translated > > iommu: DMA domain TLB invalidation policy: strict mode > > SCSI subsystem initialized > > vgaarb: loaded > > clocksource: Switched to clocksource arch_sys_counter > > PCI: CLS 0 bytes, default 64 > > workingset: timestamp_bits=46 max_order=18 bucket_order=0 > > fuse: init (API version 7.38) > > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251) > > io scheduler mq-deadline registered > > io scheduler kyber registered > > Unpacking initramfs... > > Freeing initrd memory: 4596K > > Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled > > hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters > > available > > clk: Disabling unused clocks > > Warning: unable to open an initial console. > > Freeing unused kernel memory: 1600K > > > > Thanks in advance for your great advice and support, > > Cheers, > > Lior. > > > > > -----Original Message----- > > > From: Heiko Schocher <hs@xxxxxxx> > > > Sent: Friday, December 22, 2023 10:04 AM > > > To: Dirk Behme <dirk.behme@xxxxxxxxx>; Lior Weintraub > > > <liorw@xxxxxxxxxx> > > > Cc: linux-embedded@xxxxxxxxxxxxxxx > > > Subject: Re: Debugging early SError exception > > > > > > [You don't often get email from hs@xxxxxxx. Learn why this is important > at > > > https://aka.ms/LearnAboutSenderIdentification ] > > > > > > CAUTION: External Sender > > > > > > Hello Dirk, Lior, > > > > > > On 22.12.23 08:48, Dirk Behme wrote: > > > > Am 22.12.23 um 08:03 schrieb Lior Weintraub: > > > >> Hi, > > > >> > > > >> I managed to dump the __log_buf but for some reason the UART is still > > not > > > working. > > > >> Please note that UART printed all the U-BOOT traces so AFAIU, the > device > > > tree is set correctly. > > > >> (Barebox is passing it's DTB into kernel). > > > >> > > > >> To enable the earlyprintk I have: > > > >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and > > > CONFIG_DEBUG_LL=y > > > >> 2. Modified the boot args to include: "console=ttyS0,115200n8 > > > earlycon=dw-apb-uart,0xd000307000" > > > >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon: > > > >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart", > > > early_serial8250_setup); > > > >> > > > >> From __log_buf dump: > > > >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4] > > > >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu- > > > gcc.br_real (Buildroot > > > >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107 > > > SMP Thu Dec 21 17:33:12 IST 202323 > > > >> Machine model: Pliops Spider MK-I EVKVK > > > >> efi: UEFI not found.d. > > > >> Zone ranges:s: > > > >> DMA [mem 0x0000000000000000-0x000000002fffffff]f] > > > >> DMA32 emptyty > > > >> Normal emptyty > > > >> Movable zone start for each nodede > > > >> Early memory node rangeses > > > >> node 0: [mem 0x0000000000000000-0x000000002fffffff]f] > > > >> Initmem setup node 0 [mem 0x0000000000000000- > > > 0x000000002fffffff]f] > > > >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000 > > > >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096 > > > >> pcpu-alloc: [0] 0 > > > >> Detected VIPT I-cache on CPU0U0 > > > >> CPU features: GIC system register CPU interface present but disabled by > > > higher exception levelel > > > >> CPU features: detected: ARM erratum 84571919 > > > >> alternatives: applying boot alternativeses > > > >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb- > > > uart,0xd00030700000 > > > >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, > > linear)r) > > > >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r) > > > >> Built 1 zonelists, mobility grouping on. Total pages: 19353636 > > > >> mem auto-init: stack:off, heap alloc:off, heap free:offff > > > >> software IO TLB: area num 1.1. > > > >> software IO TLB: mapped [mem 0x000000002b080000- > > > 0x000000002f080000] (64MB)B) > > > >> Memory: 689240K/786432K available (5824K kernel code, 1186K > > rwdata, > > > 1612K rodata, 1600K init, 400K > > > >> bss, 97192K reserved, 0K cma-reserved)d) > > > >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1 > > > >> trace event string verifier disableded > > > >> rcu: Hierarchical RCU implementation.n. > > > >> rcu: RCU event tracing is enabled.d. > > > >> rcu: RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1. > > > >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s. > > > >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1 > > > >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0 > > > >> GICv3: 96 SPIs implementeded > > > >> GICv3: 0 Extended SPIs implementeded > > > >> Root IRQ handler: gic_handle_irqrq > > > >> GICv3: GICv3 features: 16 PPIsIs > > > >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000 > > > >> GICv3: redistributor failed to wakeup..... > > > >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad > > > > > > > > I think the two messages above are the essential ones. > > > > > > +1 > > > > > > > Maybe it helps to check > > > > > > > > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6- > > > aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6- > > > DYArOlFkq__OwCChpFf- > > nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0- > > > > > > SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu > > > Whhipp- > > > > > > ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4 > > > > > > F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2 > > > > F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html > > > > > > > > In the middle of that page in the "Call the kernel image" it has something > > > about GIC: > > > > > > > > -- cut -- > > > > If the kernel is entered at EL1: > > > > > > > > ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1 > > > > ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1. > > > > -- cut -- > > > > > > Also may it makes sense to check your firmware (bootloader, ATF?) ... may > > > there is some setting missing for your SoC/Board ? > > > > > > bye, > > > Heiko > > > > > > > > > > >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1] > > > SMPMP > > > >> Modules linked in: > > > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107 > > > >> Hardware name: Pliops Spider MK-I EVK (DT) > > > >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > > >> pc : gic_cpu_sys_reg_init+0x58/0x2e4 > > > >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4 > > > >> sp : ffff8000808f3b40 > > > >> x29: ffff8000808f3b40 x28: 0000000000000000 x27: > > > 0000000000000001 > > > >> x26: ffff000000016040 x25: 0000000000000000 x24: > > ffff800080a6b000 > > > >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670 > > > >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10 > > > >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28 > > > >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c > > > >> x11: 452074612064656c x10: 6261736964282045 x9 : > > > 6428204552532074 > > > >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c > > > >> x5 : 000000000000002a x4 : 0000000000000000 x3 : > > > 0000000000000000 > > > >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : > 000000000000003c > > > >> Call trace: > > > >> gic_cpu_sys_reg_init+0x58/0x2e4 > > > >> gic_cpu_init.part.0+0xa8/0x114 > > > >> gic_init_bases+0x408/0x684 > > > >> gic_of_init+0x298/0x300 > > > >> of_irq_init+0x1c8/0x368 > > > >> irqchip_init+0x14/0x1c > > > >> init_IRQ+0x98/0xac > > > >> start_kernel+0x250/0x5b8 > > > >> __primary_switched+0xb4/0xbc > > > >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) ) > > > >> ---[ end trace 0000000000000000 ]----- > > > >> Kernel panic - not syncing: Attempted to kill the idle task!k! > > > >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]----- > > > >> > > > >> > > > >> The kernel panic is related to GIC distributor (currently under debug) but > > > AFAIU, > > > >> this has nothing to do with the UART not working on early stages. > > > > > > > > > > > > Yes, I agree. GIC issue and UART (at least the polling mode) should be > > > indendent. > > > > > > > > Best regards > > > > > > > > Dirk > > > > > > > > > > > >> Thanks in advanced for your advice, > > > >> Cheers, > > > >> Lior. > > > >> > > > >> > > > >>> -----Original Message----- > > > >>> From: Heiko Schocher <hs@xxxxxxx> > > > >>> Sent: Thursday, December 21, 2023 1:37 PM > > > >>> To: Lior Weintraub <liorw@xxxxxxxxxx> > > > >>> Cc: Dirk Behme <dirk.behme@xxxxxxxxx>; linux- > > > embedded@xxxxxxxxxxxxxxx > > > >>> Subject: Re: Debugging early SError exception > > > >>> > > > >>> [You don't often get email from hs@xxxxxxx. Learn why this is > important > > > at > > > >>> https://aka.ms/LearnAboutSenderIdentification ] > > > >>> > > > >>> CAUTION: External Sender > > > >>> > > > >>> Hi Lior, > > > >>> > > > >>> On 21.12.23 12:19, Dirk Behme wrote: > > > >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub: > > > >>>>> Thanks Dirk, > > > >>>>> > > > >>>>> Regarding the earlyprintk, not sure I know how to make it work. > > > >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y > > on > > > my > > > >>> config but it doesn't seem to work. > > > >>>>> Do I need to pass something in the bootargs from the U-BOOT? > > > >>>>> Do I need to add that into my device tree? > > > >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under > > > "chosen" > > > >>> on my DT but it didn't > > > >>>>> work) > > > >>>> > > > >>>> Yes, what has to be enabled and what not and what has to be set > how > > is > > > often > > > >>> confusing. I think this > > > >>>> is not common for all systems, so I think to be on the safe side you > > have > > > to look > > > >>> into the code for > > > >>>> you system. Or short; The code is the documentation ;) > > > >>>> > > > >>>> > > > >>>>> The UART I am using is "snps,dw-apb-uart". > > > >>>>> > > > >>>>> Last week, to output the early logs I have implemented this hack: > > > >>>>> 1. Modify printk macro to run my print_func > > > >>>>> 2. This print_func wrote the characters into a single global variable > > (u32 > > > >>> simul_uart;) > > > >>>>> 3. Get the address location of this global variable and extract all > writes > > to > > > it > > > >>> from the Tarmac > > > >>>>> logs. > > > >>>>> > > > >>>>> This is a very slow and tedious process but it helped me identify the > > > initial > > > >>> SError. > > > >>>>> Initially I thought I can write directly into the UART FIFO register > > (which I > > > know > > > >>> the address) > > > >>>>> but this didn't work because Linux already setup the MMU so I guess > I > > > need to > > > >>> know the virtual > > > >>>>> address of this FIFO. > > > >>>>> Do I need to use __phys_to_virt of some sort? > > > >>>> > > > >>>> Yes, I think so. Have a look to the existing serial driver, too. It should > do > > > whats > > > >>> needed, and you > > > >>>> can borrow that, then. > > > >>> > > > >>> If you have access to the RAM after the crash (through a debugger or in > > > >>> your bootloader) and your mem is stable, find out the address of > > > __log_buf > > > >>> in System.map. Thats the buffer where printk writes into it, and so > > > dumping > > > >>> the content is what you would see in case uart works... > > > >>> > > > >>> Hope it helps! > > > >>> > > > >>> bye, > > > >>> Heiko > > > >>>> > > > >>>> Best regards > > > >>>> > > > >>>> Dirk > > > >>>> > > > >>>> > > > >>>>> Cheers, > > > >>>>> Lior. > > > >>>>> > > > >>>>>> -----Original Message----- > > > >>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > > > >>>>>> Sent: Thursday, December 21, 2023 10:30 AM > > > >>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux- > > > embedded@xxxxxxxxxxxxxxx > > > >>>>>> Subject: Re: Debugging early SError exception > > > >>>>>> > > > >>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why > > > this is > > > >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ] > > > >>>>>> > > > >>>>>> CAUTION: External Sender > > > >>>>>> > > > >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub: > > > >>>>>>> Hi Dirk, > > > >>>>>>> > > > >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a > U- > > > BOOT > > > >>>>>> v2). > > > >>>>>> > > > >>>>>> Glad to hear that! :) > > > >>>>>> > > > >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the > > > UART Tx > > > >>>>>> FIFO without checking if the FIFO is full. > > > >>>>>>> Once the fifo got full it caused this SError probably because the > > UART > > > IP > > > >>>>>> generated an apberror signal. > > > >>>>>> > > > >>>>>> Thanks for the report! > > > >>>>>> > > > >>>>>>> Now the Linux is running and doesn't report the SError again but > > now > > > we > > > >>>>>> face another issue. > > > >>>>>>> We see that the PC is getting into a "report_bug" function. > > > >>>>>>> The Linux doesn't print anything to the UART (probably since it > > hasn't > > > got to > > > >>>>>> the point where the console is configured?). > > > >>>>>> > > > >>>>>> For cases like this using earlyprintk is usually a good option. Check > > > >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it > > > >>>>>> supports it. In the end it should be "just" a function in the serial > > > >>>>>> console driver which outputs the console data via polling before > > > >>>>>> (later) the interrupt driven console part takes over. > > > >>>>>> > > > >>>>>> Best regards > > > >>>>>> > > > >>>>>> Dirk > > > >>>>>> > > > >>>>>> > > > >>>>>>> Since our debug means are limited it can take some time to find > the > > > root > > > >>>>>> cause. > > > >>>>>>> > > > >>>>>>> I will keep you posted and update our findings. > > > >>>>>>> Love to hear your thoughts, > > > >>>>>>> > > > >>>>>>> Cheers, > > > >>>>>>> Lior. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>>> -----Original Message----- > > > >>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > > > >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM > > > >>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux- > > > embedded@xxxxxxxxxxxxxxx > > > >>>>>>>> Subject: Re: Debugging early SError exception > > > >>>>>>>> > > > >>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn > > why > > > this is > > > >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ] > > > >>>>>>>> > > > >>>>>>>> CAUTION: External Sender > > > >>>>>>>> > > > >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub: > > > >>>>>>>>> Thanks Dirk, > > > >>>>>>>> > > > >>>>>>>> Welcome :) > > > >>>>>>>> > > > >>>>>>>> In case you find the root cause it would be nice to get some > generic > > > >>>>>>>> description of it so that we can learn something :) > > > >>>>>>>> > > > >>>>>>>> Best regards > > > >>>>>>>> > > > >>>>>>>> Dirk > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>>>> -----Original Message----- > > > >>>>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > > > >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM > > > >>>>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux- > > > >>>>>> embedded@xxxxxxxxxxxxxxx > > > >>>>>>>>>> Subject: Re: Debugging early SError exception > > > >>>>>>>>>> > > > >>>>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn > > > why this > > > >>>>>> is > > > >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ] > > > >>>>>>>>>> > > > >>>>>>>>>> CAUTION: External Sender > > > >>>>>>>>>> > > > >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub: > > > >>>>>>>>>>> Hi, > > > >>>>>>>>>>> > > > >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5). > > > >>>>>>>>>>> This SoC is ARM64 (A53) single core based device. > > > >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation > > > platform > > > >>>>>>>>>> (Synopsys Zebu running our SoC model). > > > >>>>>>>>>>> There is no debugger connected to this emulation but there > are > > > several > > > >>>>>>>>>> debug capabilities we can use: > > > >>>>>>>>>>> 1. Generating wave dump of CPU signals > > > >>>>>>>>>>> 2. Generate a Tarmac log > > > >>>>>>>>>>> 3. UART > > > >>>>>>>>>>> > > > >>>>>>>>>>> Since the SError happens at early stages of Linux boot the > UART > > > is not > > > >>>>>>>>>> enabled yet. > > > >>>>>>>>>>> From the Tarmac log we can see: > > > >>>>>>>>>>> 3824884521 ps ES (ffff800080760888:d65f03c0) O > > > el1h_ns: ret > > > >>>>>>>>>> (parse_early_param) > > > >>>>>>>>>>> 3824884522 ps ES (ffff800080763a60:d2801800) O > > > el1h_ns: mov > > > >>>>>>>> x0, > > > >>>>>>>>>> #0xc0 // #192 (setup_arch) > > > >>>>>>>>>>> R X0 (AARCH64) 00000000 000000c0 > > > >>>>>>>>>>> 3824884523 ps ES (ffff800080763a64:d51b4220) O > > > el1h_ns: msr > > > >>>>>>>>>> daif, x0 (setup_arch) > > > >>>>>>>>>>> R CPSR 600000c5 > > > >>>>>>>>>>> 3824884529 ps ES System Error (Abort) > > > >>>>>>>>>>> EXC [0x380] SError/vSError Current EL with > SP_ELx > > > >>>>>>>>>>> R ESR_EL1 (AARCH64) bf000002 > > > >>>>>>>>>>> R CPSR 600003c5 > > > >>>>>>>>>>> R SPSR_EL1 (AARCH64) 600000c5 > > > >>>>>>>>>>> R ELR_EL1 (AARCH64) ffff8000 80763a68 > > > >>>>>>>>>>> 3824884925 ps ES (ffff800080010b80:d10543ff) O > > > el1h_ns: sub > > > >>>>>>>> sp, > > > >>>>>>>>>> sp, #0x150 (vectors) > > > >>>>>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3c50 > > > >>>>>>>>>>> 3824884925 ps ES (ffff800080010b84:8b2063ff) O > > > el1h_ns: add > > > >>>>>>>> sp, > > > >>>>>>>>>> sp, x0 (vectors) > > > >>>>>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3d10 > > > >>>>>>>>>>> 3824884926 ps ES (ffff800080010b88:cb2063e0) O > > > el1h_ns: sub > > > >>>>>>>> x0, > > > >>>>>>>>>> sp, x0 (vectors) > > > >>>>>>>>>>> R X0 (AARCH64) ffff8000 808f3c50 > > > >>>>>>>>>>> 3824884927 ps ES (ffff800080010b8c:37700080) O > > > el1h_ns: tbnz > > > >>>>>>>> w0, > > > >>>>>>>>>> #14, ffff800080010b9c <vectors+0x39c> (vectors) > > > >>>>>>>>>>> 3824884935 ps ES (ffff800080010b90:cb2063e0) O > > > el1h_ns: sub > > > >>>>>>>> x0, > > > >>>>>>>>>> sp, x0 (vectors) > > > >>>>>>>>>>> R X0 (AARCH64) 00000000 000000c0 > > > >>>>>>>>>>> 3824884937 ps ES (ffff800080010b94:cb2063ff) O > > > el1h_ns: sub > > > >>>>>> sp, > > > >>>>>>>>>> sp, x0 (vectors) > > > >>>>>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3c50 > > > >>>>>>>>>>> 3824884938 ps ES (ffff800080010b98:140001ef) O > > > el1h_ns: b > > > >>>>>>>>>> ffff800080011354 <el1h_64_error> (vectors) > > > >>>>>>>>>>> > > > >>>>>>>>>>> If I understand correctly, the exception happened sometime > > > earlier > > > >>> and > > > >>>>>>>> only > > > >>>>>>>>>> now Linux boot code (setup_arch) opened the exception > > handling > > > and as > > > >>>>>> a > > > >>>>>>>>>> result we immediately jump to the SError exception handler. > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are > > > >>>>>>>>>> running something "quite new" on some software (QEMU) > and > > > >>>>>> hardware > > > >>>>>>>>>> (Synopsis) simulators. > > > >>>>>>>>>> > > > >>>>>>>>>> That would mean that you have new hardware with e.g. new > > > memory > > > >>>>>> map > > > >>>>>>>>>> not used before. What you describe might sound like in the > code > > > before > > > >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. > This > > > >>>>>>>>>> might be an access to non-existing or non-enabled hardware. > > I.e. > > > it > > > >>>>>>>>>> might be that you try to access (read/write) an address what is > > > not > > > >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case > you > > > >>>>>>>>>> are able to modify the code before Linux (the boot loader?) > you > > > might > > > >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and > > > >>>>>>>>>> with that make the search window smaller. I'm not that > familiar > > > with > > > >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware > accesses > > > your > > > >>>>>>>>>> code does. And with that analyse all accesses and with that > > check > > > if > > > >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis) > > > emulation > > > >>>>>>>>>> system? That should be checked from valid address and from > > > hardware > > > >>>>>>>>>> subsystem enablement point of view. > > > >>>>>>>>>> > > > >>>>>>>>>> Hth, > > > >>>>>>>>>> > > > >>>>>>>>>> Dirk > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>>> From the Linux source: > > > >>>>>>>>>>> parse_early_param(); > > > >>>>>>>>>>> > > > >>>>>>>>>>> dynamic_scs_init(); > > > >>>>>>>>>>> > > > >>>>>>>>>>> /* > > > >>>>>>>>>>> * Unmask asynchronous aborts and fiq after bringing up > > > possible > > > >>>>>>>>>>> * earlycon. (Report possible System Errors once we can > > > report > > > >>> this > > > >>>>>>>>>>> * occurred). > > > >>>>>>>>>>> */ > > > >>>>>>>>>>> local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is > > > when we > > > >>>>>> get > > > >>>>>>>> the > > > >>>>>>>>>> exception. > > > >>>>>>>>>>> > > > >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract > > the > > > logs: > > > >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 > [0x410fd034] > > > >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot- > > > linux-gnu- > > > >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, > > GNU > > > ld > > > >>>>>> (GNU > > > >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023 > > > >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK > > > >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- > > SError > > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101 > > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT) > > > >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS > > > BTYPE=--) > > > >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac > > > >>>>>>>>>>> lr : setup_arch+0x134/0x5ac > > > >>>>>>>>>>> sp : ffff8000808f3da0 > > > >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27: > > > >>>>>>>>>> 0000000005e31b58c > > > >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24: > > > >>>>>>>>>> ffff8000808f8000c > > > >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21: > > > >>>>>>>> ffff800080010000c > > > >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: > > > >>>>>> 000000002266684ac > > > >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15: > > > >>>>>>>>>> 0000000000000008c > > > >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12: > > > >>>>>> 0000000000000003c > > > >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 : > > > >>>>>>>> 0000000000000038c > > > >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : > > > >>>>>>>> 0000000000000001c > > > >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 : > > > >>>>>>>>>> 0000000000000065c > > > >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 : > > > >>>>>>>>>> 00000000000000c0c > > > >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt > > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101 > > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT) > > > >>>>>>>>>>> Call trace: > > > >>>>>>>>>>> dump_backtrace+0x9c/0xd0 > > > >>>>>>>>>>> show_stack+0x14/0x1c > > > >>>>>>>>>>> dump_stack_lvl+0x44/0x58 > > > >>>>>>>>>>> dump_stack+0x14/0x1c > > > >>>>>>>>>>> panic+0x2e0/0x33c > > > >>>>>>>>>>> nmi_panic+0x68/0x6c > > > >>>>>>>>>>> arm64_serror_panic+0x68/0x78 > > > >>>>>>>>>>> do_serror+0x24/0x54 > > > >>>>>>>>>>> el1h_64_error_handler+0x2c/0x40 > > > >>>>>>>>>>> el1h_64_error+0x64/0x68 > > > >>>>>>>>>>> setup_arch+0x13c/0x5ac > > > >>>>>>>>>>> start_kernel+0x5c/0x5b8 > > > >>>>>>>>>>> __primary_switched+0xb4/0xbc > > > >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError > > > Interrupt ]--- > > > >>>>>>>>>>> > > > >>>>>>>>>>> Can you please advice how to proceed with debugging? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks in advanced, > > > >>>>>>>>>>> Cheers, > > > >>>>>>>>>>> Lior. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >>> -- > > > >>> DENX Software Engineering GmbH, Managing Director: Erika Unter > > > >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, > > Germany > > > >>> Phone: +49-8142-66989-52 Fax: +49-8142-66989-80 Email: > > > hs@xxxxxxx > > > > > > > > > > -- > > > DENX Software Engineering GmbH, Managing Director: Erika Unter > > > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany > > > Phone: +49-8142-66989-52 Fax: +49-8142-66989-80 Email: > hs@xxxxxxx