RE: Debugging early SError exception

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Update:
UART issue ("unable to open an initial console") was resolved.
I was missing CONFIG_SERIAL_8250_DW=y on my config.

Now only issue left is the CPU idle ("wfi") and no interrupts are coming.

> -----Original Message-----
> From: Lior Weintraub
> Sent: Sunday, December 24, 2023 5:42 PM
> To: hs@xxxxxxx; Dirk Behme <dirk.behme@xxxxxxxxx>
> Cc: linux-embedded@xxxxxxxxxxxxxxx
> Subject: RE: Debugging early SError exception
> 
> Hi,
> 
> The GICv3 issue was resolved after:
> 1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization support
> and hence ICC_SRE_EL2 is not supported).
> 2. Power up the GICR on EL3
> 
> The earlycon issue was resolved after:
> 1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot
> args.
> 2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only
> CONFIG_SERIAL_8250=y)
> 
> Now I face a new issue:
> Linux boot hangs on "wait for interrupt" at cpu_do_idle.
> 
> The program counter is stuck at 0xffff8000805ae45c.
> ffff8000805ae454 <cpu_do_idle>:
> ffff8000805ae454:       d5033f9f        dsb     sy
> ffff8000805ae458:       d503207f        wfi
> ffff8000805ae45c:       d65f03c0        ret
> 
> I think that something is wrong with the timers or gic setting and as a result
> the scheduler doesn't get the interrupts (timer ticks).
> 
> Additional info that might be relevant to this issue:
> The emulation platform runs at about 2.8MHz.
> The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq
> varies between 1.9-2.8MHz).
> The reason for those settings is to allow Linux to run as it would on the "real"
> world.
> 
> It is my understanding that there are 2 issues here:
> 1. Something is wrong with Timers\Interrupt setting (note that same
> configuration runs correctly on QEMU)
> 2. Something is wrong with initramfs - according kernel source it seems to fail
> to open "/dev/console"
> 
> The full Linux boot log:
> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023
> Machine model: Pliops Spider MK-I EVK
> earlycon: uart8250 at MMIO32 0x000000d000307000 (options '115200n8')
> printk: bootconsole [uart8250] enabled
> efi: UEFI not found.
> Zone ranges:
>   DMA      [mem 0x0000000000000000-0x000000002fffffff]
>   DMA32    empty
>   Normal   empty
> Movable zone start for each node
> Early memory node ranges
>   node   0: [mem 0x0000000000000000-0x000000002fffffff]
> Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400
> Detected VIPT I-cache on CPU0
> CPU features: detected: GIC system register CPU interface
> CPU features: detected: ARM erratum 845719
> alternatives: applying boot alternatives
> Kernel command line: console=ttyS0,115200n8
> earlycon=uart8250,mmio32,0xd000307000,115200n8
> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
> Built 1 zonelists, mobility grouping on.  Total pages: 193536
> mem auto-init: stack:off, heap alloc:off, heap free:off
> software IO TLB: area num 1.
> software IO TLB: mapped [mem 0x000000002b080000-
> 0x000000002f080000] (64MB)
> Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata,
> 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)
> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> trace event string verifier disabled
> rcu: Hierarchical RCU implementation.
> rcu:    RCU event tracing is enabled.
> rcu:    RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.
> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> GICv3: 96 SPIs implemented
> GICv3: 0 Extended SPIs implemented
> Root IRQ handler: gic_handle_irq
> GICv3: GICv3 features: 16 PPIs
> GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000
> ITS [mem 0xe000040000-0xe00005ffff]
> ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz 8,
> psz 64K, shr 1)
> ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000
> (flat, esz 2, psz 64K, shr 1)
> GICv3: Expected reserved range
> [0x00000000000c0000:0x00000000000cffff], not found
> GICv3: using LPI property table @0x00000000000c0000
> GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted
> CPU0: Failed to disable LPIs
> rcu: srcu_init: Setting srcu_struct sizes based on contention.
> arch_timer: cp15 timer(s) running at 62.50MHz (virt).
> clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles:
> 0x1cd42e208c, max_idle_ns: 881590405314 ns
> sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every
> 4398046511096ns
> Console: colour dummy device 80x25
> Calibrating delay loop (skipped), value calculated using timer frequency..
> 125.00 BogoMIPS (lpj=250000)
> pid_max: default: 32768 minimum: 301
> Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> cacheinfo: Unable to detect cache hierarchy for CPU 0
> rcu: Hierarchical SRCU implementation.
> rcu:    Max phase no-delay instances is 1000.
> Platform MSI: gic-its@E000040000 domain created
> PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000
> domain created
> EFI services will not be available.
> smp: Bringing up secondary CPUs ...
> smp: Brought up 1 node, 1 CPU
> SMP: Total of 1 processors activated.
> CPU features: detected: 32-bit EL0 Support
> CPU features: detected: CRC32 instructions
> CPU: All CPU(s) started at EL1
> alternatives: applying system-wide alternatives
> devtmpfs: initialized
> clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns:
> 7645041785100000 ns
> futex hash table entries: 256 (order: 2, 16384 bytes, linear)
> DMI not present or invalid.
> DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic
> allocations
> DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic
> allocations
> hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
> ASID allocator initialised with 65536 entries
> Serial: AMBA PL011 UART driver
> Modules: 30080 pages in range for non-PLT usage
> Modules: 521600 pages in range for PLT usage
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: strict mode
> SCSI subsystem initialized
> vgaarb: loaded
> clocksource: Switched to clocksource arch_sys_counter
> PCI: CLS 0 bytes, default 64
> workingset: timestamp_bits=46 max_order=18 bucket_order=0
> fuse: init (API version 7.38)
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> Unpacking initramfs...
> Freeing initrd memory: 4596K
> Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters
> available
> clk: Disabling unused clocks
> Warning: unable to open an initial console.
> Freeing unused kernel memory: 1600K
> 
> Thanks in advance for your great advice and support,
> Cheers,
> Lior.
> 
> > -----Original Message-----
> > From: Heiko Schocher <hs@xxxxxxx>
> > Sent: Friday, December 22, 2023 10:04 AM
> > To: Dirk Behme <dirk.behme@xxxxxxxxx>; Lior Weintraub
> > <liorw@xxxxxxxxxx>
> > Cc: linux-embedded@xxxxxxxxxxxxxxx
> > Subject: Re: Debugging early SError exception
> >
> > [You don't often get email from hs@xxxxxxx. Learn why this is important at
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > CAUTION: External Sender
> >
> > Hello Dirk, Lior,
> >
> > On 22.12.23 08:48, Dirk Behme wrote:
> > > Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> > >> Hi,
> > >>
> > >> I managed to dump the __log_buf but for some reason the UART is still
> not
> > working.
> > >> Please note that UART printed all the U-BOOT traces so AFAIU, the device
> > tree is set correctly.
> > >> (Barebox is passing it's DTB into kernel).
> > >>
> > >> To enable the earlyprintk I have:
> > >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and
> > CONFIG_DEBUG_LL=y
> > >> 2. Modified the boot args to include: "console=ttyS0,115200n8
> > earlycon=dw-apb-uart,0xd000307000"
> > >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> > >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart",
> > early_serial8250_setup);
> > >>
> > >>  From __log_buf dump:
> > >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> > >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > gcc.br_real (Buildroot
> > >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107
> > SMP Thu Dec 21 17:33:12 IST 202323
> > >> Machine model: Pliops Spider MK-I EVKVK
> > >> efi: UEFI not found.d.
> > >> Zone ranges:s:
> > >>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
> > >>    DMA32    emptyty
> > >>    Normal   emptyty
> > >> Movable zone start for each nodede
> > >> Early memory node rangeses
> > >>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> > >> Initmem setup node 0 [mem 0x0000000000000000-
> > 0x000000002fffffff]f]
> > >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> > >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> > >> pcpu-alloc: [0] 0
> > >> Detected VIPT I-cache on CPU0U0
> > >> CPU features: GIC system register CPU interface present but disabled by
> > higher exception levelel
> > >> CPU features: detected: ARM erratum 84571919
> > >> alternatives: applying boot alternativeses
> > >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-
> > uart,0xd00030700000
> > >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes,
> linear)r)
> > >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> > >> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> > >> mem auto-init: stack:off, heap alloc:off, heap free:offff
> > >> software IO TLB: area num 1.1.
> > >> software IO TLB: mapped [mem 0x000000002b080000-
> > 0x000000002f080000] (64MB)B)
> > >> Memory: 689240K/786432K available (5824K kernel code, 1186K
> rwdata,
> > 1612K rodata, 1600K init, 400K
> > >> bss, 97192K reserved, 0K cma-reserved)d)
> > >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> > >> trace event string verifier disableded
> > >> rcu: Hierarchical RCU implementation.n.
> > >> rcu:     RCU event tracing is enabled.d.
> > >> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> > >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> > >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> > >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> > >> GICv3: 96 SPIs implementeded
> > >> GICv3: 0 Extended SPIs implementeded
> > >> Root IRQ handler: gic_handle_irqrq
> > >> GICv3: GICv3 features: 16 PPIsIs
> > >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> > >> GICv3: redistributor failed to wakeup.....
> > >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> > >
> > > I think the two messages above are the essential ones.
> >
> > +1
> >
> > > Maybe it helps to check
> > >
> > > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6-
> > aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6-
> > DYArOlFkq__OwCChpFf-
> nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0-
> >
> SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu
> > Whhipp-
> >
> ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4
> >
> F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2
> > F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html
> > >
> > > In the middle of that page in the "Call the kernel image" it has something
> > about GIC:
> > >
> > > -- cut --
> > > If the kernel is entered at EL1:
> > >
> > >         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> > >         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> > > -- cut --
> >
> > Also may it makes sense to check your firmware (bootloader, ATF?) ... may
> > there is some setting missing for your SoC/Board ?
> >
> > bye,
> > Heiko
> >
> > >
> > >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1]
> > SMPMP
> > >> Modules linked in:
> > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> > >> Hardware name: Pliops Spider MK-I EVK (DT)
> > >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > >> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> > >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> > >> sp : ffff8000808f3b40
> > >> x29: ffff8000808f3b40 x28: 0000000000000000 x27:
> > 0000000000000001
> > >> x26: ffff000000016040 x25: 0000000000000000 x24:
> ffff800080a6b000
> > >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> > >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> > >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> > >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> > >> x11: 452074612064656c x10: 6261736964282045 x9 :
> > 6428204552532074
> > >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> > >> x5 : 000000000000002a x4 : 0000000000000000 x3 :
> > 0000000000000000
> > >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 : 000000000000003c
> > >> Call trace:
> > >>   gic_cpu_sys_reg_init+0x58/0x2e4
> > >>   gic_cpu_init.part.0+0xa8/0x114
> > >>   gic_init_bases+0x408/0x684
> > >>   gic_of_init+0x298/0x300
> > >>   of_irq_init+0x1c8/0x368
> > >>   irqchip_init+0x14/0x1c
> > >>   init_IRQ+0x98/0xac
> > >>   start_kernel+0x250/0x5b8
> > >>   __primary_switched+0xb4/0xbc
> > >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> > >> ---[ end trace 0000000000000000 ]-----
> > >> Kernel panic - not syncing: Attempted to kill the idle task!k!
> > >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> > >>
> > >>
> > >> The kernel panic is related to GIC distributor (currently under debug) but
> > AFAIU,
> > >> this has nothing to do with the UART not working on early stages.
> > >
> > >
> > > Yes, I agree. GIC issue and UART (at least the polling mode) should be
> > indendent.
> > >
> > > Best regards
> > >
> > > Dirk
> > >
> > >
> > >> Thanks in advanced for your advice,
> > >> Cheers,
> > >> Lior.
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Heiko Schocher <hs@xxxxxxx>
> > >>> Sent: Thursday, December 21, 2023 1:37 PM
> > >>> To: Lior Weintraub <liorw@xxxxxxxxxx>
> > >>> Cc: Dirk Behme <dirk.behme@xxxxxxxxx>; linux-
> > embedded@xxxxxxxxxxxxxxx
> > >>> Subject: Re: Debugging early SError exception
> > >>>
> > >>> [You don't often get email from hs@xxxxxxx. Learn why this is important
> > at
> > >>> https://aka.ms/LearnAboutSenderIdentification ]
> > >>>
> > >>> CAUTION: External Sender
> > >>>
> > >>> Hi Lior,
> > >>>
> > >>> On 21.12.23 12:19, Dirk Behme wrote:
> > >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> > >>>>> Thanks Dirk,
> > >>>>>
> > >>>>> Regarding the earlyprintk, not sure I know how to make it work.
> > >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
> on
> > my
> > >>> config but it doesn't seem to work.
> > >>>>> Do I need to pass something in the bootargs from the U-BOOT?
> > >>>>> Do I need to add that into my device tree?
> > >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> > "chosen"
> > >>> on my DT but it didn't
> > >>>>> work)
> > >>>>
> > >>>> Yes, what has to be enabled and what not and what has to be set how
> is
> > often
> > >>> confusing. I think this
> > >>>> is not common for all systems, so I think to be on the safe side you
> have
> > to look
> > >>> into the code for
> > >>>> you system. Or short; The code is the documentation ;)
> > >>>>
> > >>>>
> > >>>>> The UART I am using is "snps,dw-apb-uart".
> > >>>>>
> > >>>>> Last week, to output the early logs I have implemented this hack:
> > >>>>> 1. Modify printk macro to run my print_func
> > >>>>> 2. This print_func wrote the characters into a single global variable
> (u32
> > >>> simul_uart;)
> > >>>>> 3. Get the address location of this global variable and extract all writes
> to
> > it
> > >>> from the Tarmac
> > >>>>> logs.
> > >>>>>
> > >>>>> This is a very slow and tedious process but it helped me identify the
> > initial
> > >>> SError.
> > >>>>> Initially I thought I can write directly into the UART FIFO register
> (which I
> > know
> > >>> the address)
> > >>>>> but this didn't work because Linux already setup the MMU so I guess I
> > need to
> > >>> know the virtual
> > >>>>> address of this FIFO.
> > >>>>> Do I need to use __phys_to_virt of some sort?
> > >>>>
> > >>>> Yes, I think so. Have a look to the existing serial driver, too. It should do
> > whats
> > >>> needed, and you
> > >>>> can borrow that, then.
> > >>>
> > >>> If you have access to the RAM after the crash (through a debugger or in
> > >>> your bootloader) and your mem is stable, find out the address of
> > __log_buf
> > >>> in System.map. Thats the buffer where printk writes into it, and so
> > dumping
> > >>> the content is what you would see in case uart works...
> > >>>
> > >>> Hope it helps!
> > >>>
> > >>> bye,
> > >>> Heiko
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>> Dirk
> > >>>>
> > >>>>
> > >>>>> Cheers,
> > >>>>> Lior.
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > >>>>>> Sent: Thursday, December 21, 2023 10:30 AM
> > >>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > embedded@xxxxxxxxxxxxxxx
> > >>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>
> > >>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why
> > this is
> > >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>
> > >>>>>> CAUTION: External Sender
> > >>>>>>
> > >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> > >>>>>>> Hi Dirk,
> > >>>>>>>
> > >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a U-
> > BOOT
> > >>>>>> v2).
> > >>>>>>
> > >>>>>> Glad to hear that! :)
> > >>>>>>
> > >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the
> > UART Tx
> > >>>>>> FIFO without checking if the FIFO is full.
> > >>>>>>> Once the fifo got full it caused this SError probably because the
> UART
> > IP
> > >>>>>> generated an apberror signal.
> > >>>>>>
> > >>>>>> Thanks for the report!
> > >>>>>>
> > >>>>>>> Now the Linux is running and doesn't report the SError again but
> now
> > we
> > >>>>>> face another issue.
> > >>>>>>> We see that the PC is getting into a "report_bug" function.
> > >>>>>>> The Linux doesn't print anything to the UART (probably since it
> hasn't
> > got to
> > >>>>>> the point where the console is configured?).
> > >>>>>>
> > >>>>>> For cases like this using earlyprintk is usually a good option. Check
> > >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
> > >>>>>> supports it. In the end it should be "just" a function in the serial
> > >>>>>> console driver which outputs the console data via polling before
> > >>>>>> (later) the interrupt driven console part takes over.
> > >>>>>>
> > >>>>>> Best regards
> > >>>>>>
> > >>>>>> Dirk
> > >>>>>>
> > >>>>>>
> > >>>>>>> Since our debug means are limited it can take some time to find the
> > root
> > >>>>>> cause.
> > >>>>>>>
> > >>>>>>> I will keep you posted and update our findings.
> > >>>>>>> Love to hear your thoughts,
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Lior.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> > >>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > embedded@xxxxxxxxxxxxxxx
> > >>>>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>>>
> > >>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn
> why
> > this is
> > >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>>>
> > >>>>>>>> CAUTION: External Sender
> > >>>>>>>>
> > >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> > >>>>>>>>> Thanks Dirk,
> > >>>>>>>>
> > >>>>>>>> Welcome :)
> > >>>>>>>>
> > >>>>>>>> In case you find the root cause it would be nice to get some generic
> > >>>>>>>> description of it so that we can learn something :)
> > >>>>>>>>
> > >>>>>>>> Best regards
> > >>>>>>>>
> > >>>>>>>> Dirk
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> > >>>>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > >>>>>> embedded@xxxxxxxxxxxxxxx
> > >>>>>>>>>> Subject: Re: Debugging early SError exception
> > >>>>>>>>>>
> > >>>>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn
> > why this
> > >>>>>> is
> > >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > >>>>>>>>>>
> > >>>>>>>>>> CAUTION: External Sender
> > >>>>>>>>>>
> > >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> > >>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
> > >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> > platform
> > >>>>>>>>>> (Synopsys Zebu running our SoC model).
> > >>>>>>>>>>> There is no debugger connected to this emulation but there are
> > several
> > >>>>>>>>>> debug capabilities we can use:
> > >>>>>>>>>>> 1. Generating wave dump of CPU signals
> > >>>>>>>>>>> 2. Generate a Tarmac log
> > >>>>>>>>>>> 3. UART
> > >>>>>>>>>>>
> > >>>>>>>>>>> Since the SError happens at early stages of Linux boot the UART
> > is not
> > >>>>>>>>>> enabled yet.
> > >>>>>>>>>>>      From the Tarmac log we can see:
> > >>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O
> > el1h_ns:   ret
> > >>>>>>>>>> (parse_early_param)
> > >>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O
> > el1h_ns:   mov
> > >>>>>>>> x0,
> > >>>>>>>>>> #0xc0   //      #192    (setup_arch)
> > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > >>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O
> > el1h_ns:   msr
> > >>>>>>>>>> daif,   x0      (setup_arch)
> > >>>>>>>>>>>                          R CPSR 600000c5
> > >>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
> > >>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with SP_ELx
> > >>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
> > >>>>>>>>>>>                          R CPSR 600003c5
> > >>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
> > >>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
> > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O
> > el1h_ns:   sub
> > >>>>>>>> sp,
> > >>>>>>>>>> sp,     #0x150  (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O
> > el1h_ns:   add
> > >>>>>>>> sp,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
> > >>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O
> > el1h_ns:   sub
> > >>>>>>>> x0,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O
> > el1h_ns:   tbnz
> > >>>>>>>> w0,
> > >>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> > >>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O
> > el1h_ns:   sub
> > >>>>>>>> x0,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > >>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O
> > el1h_ns:   sub
> > >>>>>> sp,
> > >>>>>>>>>> sp,     x0      (vectors)
> > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > >>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O
> > el1h_ns:   b
> > >>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> > >>>>>>>>>>>
> > >>>>>>>>>>> If I understand correctly, the exception happened sometime
> > earlier
> > >>> and
> > >>>>>>>> only
> > >>>>>>>>>> now Linux boot code (setup_arch) opened the exception
> handling
> > and as
> > >>>>>> a
> > >>>>>>>>>> result we immediately jump to the SError exception handler.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> > >>>>>>>>>> running something "quite new" on some software (QEMU) and
> > >>>>>> hardware
> > >>>>>>>>>> (Synopsis) simulators.
> > >>>>>>>>>>
> > >>>>>>>>>> That would mean that you have new hardware with e.g. new
> > memory
> > >>>>>> map
> > >>>>>>>>>> not used before. What you describe might sound like in the code
> > before
> > >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError. This
> > >>>>>>>>>> might be an access to non-existing or non-enabled hardware.
> I.e.
> > it
> > >>>>>>>>>> might be that you try to access (read/write) an address what is
> > not
> > >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case you
> > >>>>>>>>>> are able to modify the code before Linux (the boot loader?) you
> > might
> > >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> > >>>>>>>>>> with that make the search window smaller. I'm not that familiar
> > with
> > >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware accesses
> > your
> > >>>>>>>>>> code does. And with that analyse all accesses and with that
> check
> > if
> > >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis)
> > emulation
> > >>>>>>>>>> system? That should be checked from valid address and from
> > hardware
> > >>>>>>>>>> subsystem enablement point of view.
> > >>>>>>>>>>
> > >>>>>>>>>> Hth,
> > >>>>>>>>>>
> > >>>>>>>>>> Dirk
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>>      From the Linux source:
> > >>>>>>>>>>>           parse_early_param();
> > >>>>>>>>>>>
> > >>>>>>>>>>>           dynamic_scs_init();
> > >>>>>>>>>>>
> > >>>>>>>>>>>           /*
> > >>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up
> > possible
> > >>>>>>>>>>>            * earlycon. (Report possible System Errors once we can
> > report
> > >>> this
> > >>>>>>>>>>>            * occurred).
> > >>>>>>>>>>>            */
> > >>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is
> > when we
> > >>>>>> get
> > >>>>>>>> the
> > >>>>>>>>>> exception.
> > >>>>>>>>>>>
> > >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract
> the
> > logs:
> > >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-
> > linux-gnu-
> > >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0,
> GNU
> > ld
> > >>>>>> (GNU
> > >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> > >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> > >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 --
> SError
> > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
> > >>>>>>>>>>> lr : setup_arch+0x134/0x5ac
> > >>>>>>>>>>> sp : ffff8000808f3da0
> > >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> > >>>>>>>>>> 0000000005e31b58c
> > >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> > >>>>>>>>>> ffff8000808f8000c
> > >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> > >>>>>>>> ffff800080010000c
> > >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> > >>>>>> 000000002266684ac
> > >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> > >>>>>>>>>> 0000000000000008c
> > >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> > >>>>>> 0000000000000003c
> > >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> > >>>>>>>> 0000000000000038c
> > >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> > >>>>>>>> 0000000000000001c
> > >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> > >>>>>>>>>> 0000000000000065c
> > >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> > >>>>>>>>>> 00000000000000c0c
> > >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > >>>>>>>>>>> Call trace:
> > >>>>>>>>>>>       dump_backtrace+0x9c/0xd0
> > >>>>>>>>>>>       show_stack+0x14/0x1c
> > >>>>>>>>>>>       dump_stack_lvl+0x44/0x58
> > >>>>>>>>>>>       dump_stack+0x14/0x1c
> > >>>>>>>>>>>       panic+0x2e0/0x33c
> > >>>>>>>>>>>       nmi_panic+0x68/0x6c
> > >>>>>>>>>>>       arm64_serror_panic+0x68/0x78
> > >>>>>>>>>>>       do_serror+0x24/0x54
> > >>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
> > >>>>>>>>>>>       el1h_64_error+0x64/0x68
> > >>>>>>>>>>>       setup_arch+0x13c/0x5ac
> > >>>>>>>>>>>       start_kernel+0x5c/0x5b8
> > >>>>>>>>>>>       __primary_switched+0xb4/0xbc
> > >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError
> > Interrupt ]---
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can you please advice how to proceed with debugging?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks in advanced,
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Lior.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> Germany
> > >>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> > hs@xxxxxxx
> > >
> >
> > --
> > DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> > Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs@xxxxxxx





[Index of Archives]     [Gstreamer Embedded]     [Linux MMC Devel]     [U-Boot V2]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux ARM Kernel]     [Linux OMAP]     [Linux SCSI]

  Powered by Linux