RE: Debugging early SError exception

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Update:
Issue with CPU idle was found.
It was related to our SoC changes in timers interrupt connectivity (which makes sense :-)).
Marry XMAS all.

> -----Original Message-----
> From: Lior Weintraub
> Sent: Sunday, December 24, 2023 9:12 PM
> To: hs@xxxxxxx; Dirk Behme <dirk.behme@xxxxxxxxx>
> Cc: linux-embedded@xxxxxxxxxxxxxxx
> Subject: RE: Debugging early SError exception
> 
> Update:
> UART issue ("unable to open an initial console") was resolved.
> I was missing CONFIG_SERIAL_8250_DW=y on my config.
> 
> Now only issue left is the CPU idle ("wfi") and no interrupts are coming.
> 
> > -----Original Message-----
> > From: Lior Weintraub
> > Sent: Sunday, December 24, 2023 5:42 PM
> > To: hs@xxxxxxx; Dirk Behme <dirk.behme@xxxxxxxxx>
> > Cc: linux-embedded@xxxxxxxxxxxxxxx
> > Subject: RE: Debugging early SError exception
> >
> > Hi,
> >
> > The GICv3 issue was resolved after:
> > 1. Setting bit 0 and bit 3 on ICC_SRE_EL3 (we don't have virtualization
> support
> > and hence ICC_SRE_EL2 is not supported).
> > 2. Power up the GICR on EL3
> >
> > The earlycon issue was resolved after:
> > 1. Add to "earlycon=uart8250,mmio32,0xd000307000,115200n8" to boot
> > args.
> > 2. Add "CONFIG_SERIAL_8250_CONSOLE=y" to config (previously had only
> > CONFIG_SERIAL_8250=y)
> >
> > Now I face a new issue:
> > Linux boot hangs on "wait for interrupt" at cpu_do_idle.
> >
> > The program counter is stuck at 0xffff8000805ae45c.
> > ffff8000805ae454 <cpu_do_idle>:
> > ffff8000805ae454:       d5033f9f        dsb     sy
> > ffff8000805ae458:       d503207f        wfi
> > ffff8000805ae45c:       d65f03c0        ret
> >
> > I think that something is wrong with the timers or gic setting and as a result
> > the scheduler doesn't get the interrupts (timer ticks).
> >
> > Additional info that might be relevant to this issue:
> > The emulation platform runs at about 2.8MHz.
> > The CNTFRQ_EL0 is set to 2M (because the emulation platform running freq
> > varies between 1.9-2.8MHz).
> > The reason for those settings is to allow Linux to run as it would on the "real"
> > world.
> >
> > It is my understanding that there are 2 issues here:
> > 1. Something is wrong with Timers\Interrupt setting (note that same
> > configuration runs correctly on QEMU)
> > 2. Something is wrong with initramfs - according kernel source it seems to
> fail
> > to open "/dev/console"
> >
> > The full Linux boot log:
> > Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU
> > Binuti) 2.38) #112 SMP Sun Dec 24 15:44:56 IST 2023
> > Machine model: Pliops Spider MK-I EVK
> > earlycon: uart8250 at MMIO32 0x000000d000307000 (options
> '115200n8')
> > printk: bootconsole [uart8250] enabled
> > efi: UEFI not found.
> > Zone ranges:
> >   DMA      [mem 0x0000000000000000-0x000000002fffffff]
> >   DMA32    empty
> >   Normal   empty
> > Movable zone start for each node
> > Early memory node ranges
> >   node   0: [mem 0x0000000000000000-0x000000002fffffff]
> > Initmem setup node 0 [mem 0x0000000000000000-0x000000002fffffff]
> > percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u102400
> > Detected VIPT I-cache on CPU0
> > CPU features: detected: GIC system register CPU interface
> > CPU features: detected: ARM erratum 845719
> > alternatives: applying boot alternatives
> > Kernel command line: console=ttyS0,115200n8
> > earlycon=uart8250,mmio32,0xd000307000,115200n8
> > Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
> > Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
> > Built 1 zonelists, mobility grouping on.  Total pages: 193536
> > mem auto-init: stack:off, heap alloc:off, heap free:off
> > software IO TLB: area num 1.
> > software IO TLB: mapped [mem 0x000000002b080000-
> > 0x000000002f080000] (64MB)
> > Memory: 689240K/786432K available (5824K kernel code, 1186K rwdata,
> > 1612K rodata, 1600K init, 400K bss, 97192K reserved, 0K cma-reserved)
> > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> > trace event string verifier disabled
> > rcu: Hierarchical RCU implementation.
> > rcu:    RCU event tracing is enabled.
> > rcu:    RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.
> > rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> > rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
> > NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> > GICv3: 96 SPIs implemented
> > GICv3: 0 Extended SPIs implemented
> > Root IRQ handler: gic_handle_irq
> > GICv3: GICv3 features: 16 PPIs
> > GICv3: CPU0: found redistributor 0 region 0:0x000000e000060000
> > ITS [mem 0xe000040000-0xe00005ffff]
> > ITS@0x000000e000040000: allocated 8192 Devices @a0000 (indirect, esz
> 8,
> > psz 64K, shr 1)
> > ITS@0x000000e000040000: allocated 32768 Interrupt Collections @b0000
> > (flat, esz 2, psz 64K, shr 1)
> > GICv3: Expected reserved range
> > [0x00000000000c0000:0x00000000000cffff], not found
> > GICv3: using LPI property table @0x00000000000c0000
> > GICv3: CPU0: Booted with LPIs enabled, memory probably corrupted
> > CPU0: Failed to disable LPIs
> > rcu: srcu_init: Setting srcu_struct sizes based on contention.
> > arch_timer: cp15 timer(s) running at 62.50MHz (virt).
> > clocksource: arch_sys_counter: mask: 0x1ffffffffffffff max_cycles:
> > 0x1cd42e208c, max_idle_ns: 881590405314 ns
> > sched_clock: 57 bits at 63MHz, resolution 16ns, wraps every
> > 4398046511096ns
> > Console: colour dummy device 80x25
> > Calibrating delay loop (skipped), value calculated using timer frequency..
> > 125.00 BogoMIPS (lpj=250000)
> > pid_max: default: 32768 minimum: 301
> > Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> > Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
> > cacheinfo: Unable to detect cache hierarchy for CPU 0
> > rcu: Hierarchical SRCU implementation.
> > rcu:    Max phase no-delay instances is 1000.
> > Platform MSI: gic-its@E000040000 domain created
> > PCI/MSI: /soc/interrupt-controller@E000000000/gic-its@E000040000
> > domain created
> > EFI services will not be available.
> > smp: Bringing up secondary CPUs ...
> > smp: Brought up 1 node, 1 CPU
> > SMP: Total of 1 processors activated.
> > CPU features: detected: 32-bit EL0 Support
> > CPU features: detected: CRC32 instructions
> > CPU: All CPU(s) started at EL1
> > alternatives: applying system-wide alternatives
> > devtmpfs: initialized
> > clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns:
> > 7645041785100000 ns
> > futex hash table entries: 256 (order: 2, 16384 bytes, linear)
> > DMI not present or invalid.
> > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
> > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic
> > allocations
> > DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic
> > allocations
> > hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
> > ASID allocator initialised with 65536 entries
> > Serial: AMBA PL011 UART driver
> > Modules: 30080 pages in range for non-PLT usage
> > Modules: 521600 pages in range for PLT usage
> > iommu: Default domain type: Translated
> > iommu: DMA domain TLB invalidation policy: strict mode
> > SCSI subsystem initialized
> > vgaarb: loaded
> > clocksource: Switched to clocksource arch_sys_counter
> > PCI: CLS 0 bytes, default 64
> > workingset: timestamp_bits=46 max_order=18 bucket_order=0
> > fuse: init (API version 7.38)
> > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
> > io scheduler mq-deadline registered
> > io scheduler kyber registered
> > Unpacking initramfs...
> > Freeing initrd memory: 4596K
> > Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
> > hw perfevents: enabled with armv8_cortex_a53 PMU driver, 7 counters
> > available
> > clk: Disabling unused clocks
> > Warning: unable to open an initial console.
> > Freeing unused kernel memory: 1600K
> >
> > Thanks in advance for your great advice and support,
> > Cheers,
> > Lior.
> >
> > > -----Original Message-----
> > > From: Heiko Schocher <hs@xxxxxxx>
> > > Sent: Friday, December 22, 2023 10:04 AM
> > > To: Dirk Behme <dirk.behme@xxxxxxxxx>; Lior Weintraub
> > > <liorw@xxxxxxxxxx>
> > > Cc: linux-embedded@xxxxxxxxxxxxxxx
> > > Subject: Re: Debugging early SError exception
> > >
> > > [You don't often get email from hs@xxxxxxx. Learn why this is important
> at
> > > https://aka.ms/LearnAboutSenderIdentification ]
> > >
> > > CAUTION: External Sender
> > >
> > > Hello Dirk, Lior,
> > >
> > > On 22.12.23 08:48, Dirk Behme wrote:
> > > > Am 22.12.23 um 08:03 schrieb Lior Weintraub:
> > > >> Hi,
> > > >>
> > > >> I managed to dump the __log_buf but for some reason the UART is still
> > not
> > > working.
> > > >> Please note that UART printed all the U-BOOT traces so AFAIU, the
> device
> > > tree is set correctly.
> > > >> (Barebox is passing it's DTB into kernel).
> > > >>
> > > >> To enable the earlyprintk I have:
> > > >> 1. Compiled the kernel with CONFIG_EARLY_PRINTK=y and
> > > CONFIG_DEBUG_LL=y
> > > >> 2. Modified the boot args to include: "console=ttyS0,115200n8
> > > earlycon=dw-apb-uart,0xd000307000"
> > > >> 3. Verified that dw-apb-uart driver (8250_early.c) supports earlycon:
> > > >> OF_EARLYCON_DECLARE(uart, "snps,dw-apb-uart",
> > > early_serial8250_setup);
> > > >>
> > > >>  From __log_buf dump:
> > > >> Booting Linux on physical CPU 0x0000000000 [0x410fd034]4]
> > > >> Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux-gnu-
> > > gcc.br_real (Buildroot
> > > >> 2023.02.1-95-g8391404e23) 11.3.0, GNU ld (GNU Binutils) 2.38) #107
> > > SMP Thu Dec 21 17:33:12 IST 202323
> > > >> Machine model: Pliops Spider MK-I EVKVK
> > > >> efi: UEFI not found.d.
> > > >> Zone ranges:s:
> > > >>    DMA      [mem 0x0000000000000000-0x000000002fffffff]f]
> > > >>    DMA32    emptyty
> > > >>    Normal   emptyty
> > > >> Movable zone start for each nodede
> > > >> Early memory node rangeses
> > > >>    node   0: [mem 0x0000000000000000-0x000000002fffffff]f]
> > > >> Initmem setup node 0 [mem 0x0000000000000000-
> > > 0x000000002fffffff]f]
> > > >> percpu: Embedded 25 pages/cpu s64800 r8192 d29408 u10240000
> > > >> pcpu-alloc: s64800 r8192 d29408 u102400 alloc=25*4096
> > > >> pcpu-alloc: [0] 0
> > > >> Detected VIPT I-cache on CPU0U0
> > > >> CPU features: GIC system register CPU interface present but disabled by
> > > higher exception levelel
> > > >> CPU features: detected: ARM erratum 84571919
> > > >> alternatives: applying boot alternativeses
> > > >> Kernel command line: console=ttyS0,115200n8 earlycon=dw-apb-
> > > uart,0xd00030700000
> > > >> Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes,
> > linear)r)
> > > >> Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)r)
> > > >> Built 1 zonelists, mobility grouping on.  Total pages: 19353636
> > > >> mem auto-init: stack:off, heap alloc:off, heap free:offff
> > > >> software IO TLB: area num 1.1.
> > > >> software IO TLB: mapped [mem 0x000000002b080000-
> > > 0x000000002f080000] (64MB)B)
> > > >> Memory: 689240K/786432K available (5824K kernel code, 1186K
> > rwdata,
> > > 1612K rodata, 1600K init, 400K
> > > >> bss, 97192K reserved, 0K cma-reserved)d)
> > > >> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1=1
> > > >> trace event string verifier disableded
> > > >> rcu: Hierarchical RCU implementation.n.
> > > >> rcu:     RCU event tracing is enabled.d.
> > > >> rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=1.1.
> > > >> rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.s.
> > > >> rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1=1
> > > >> NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0 0
> > > >> GICv3: 96 SPIs implementeded
> > > >> GICv3: 0 Extended SPIs implementeded
> > > >> Root IRQ handler: gic_handle_irqrq
> > > >> GICv3: GICv3 features: 16 PPIsIs
> > > >> GICv3: CPU0: found redistributor 0 region 0:0x000000e00006000000
> > > >> GICv3: redistributor failed to wakeup.....
> > > >> GICv3: GIC: unable to set SRE (disabled at EL2), panic aheadad
> > > >
> > > > I think the two messages above are the essential ones.
> > >
> > > +1
> > >
> > > > Maybe it helps to check
> > > >
> > > > https://secure-web.cisco.com/1VmuNXQkE6u---G9xsJ8CPb6-
> > > aguDK_MyJeUn43QsTaafgaifoFTAvcD4vQefYzFntmjc8L_J46du6-
> > > DYArOlFkq__OwCChpFf-
> > nXIyddL3MCQMsTZ9hIk_WCfDqIi1wSEmPSBClIYS0-
> > >
> >
> SAjwPiOf7sA2wLvt_5ehGaTHO61NJEWdOrfKy9pBT1_RDyQGXi7kz8XuAUpu
> > > Whhipp-
> > >
> >
> ngljUJcxkHkmWDvpocGule5ZNEe5UZ3nGNjUnqCU8J_bXtCgNPEk4CyorLt7g4
> > >
> >
> F5Ks85tlVEEutu8vyJXu8_TUacURkRnQgjvood6iVOn5w2TpSRn/https%3A%2
> > >
> F%2Fwww.kernel.org%2Fdoc%2Fhtml%2Fv5.3%2Farm64%2Fbooting.html
> > > >
> > > > In the middle of that page in the "Call the kernel image" it has something
> > > about GIC:
> > > >
> > > > -- cut --
> > > > If the kernel is entered at EL1:
> > > >
> > > >         ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> > > >         ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> > > > -- cut --
> > >
> > > Also may it makes sense to check your firmware (bootloader, ATF?) ... may
> > > there is some setting missing for your SoC/Board ?
> > >
> > > bye,
> > > Heiko
> > >
> > > >
> > > >> Internal error: Oops - Undefined instruction: 0000000062383019 [#1]
> > > SMPMP
> > > >> Modules linked in:
> > > >> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0 #107
> > > >> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > >> pc : gic_cpu_sys_reg_init+0x58/0x2e4
> > > >> lr : gic_cpu_sys_reg_init+0x2a4/0x2e4
> > > >> sp : ffff8000808f3b40
> > > >> x29: ffff8000808f3b40 x28: 0000000000000000 x27:
> > > 0000000000000001
> > > >> x26: ffff000000016040 x25: 0000000000000000 x24:
> > ffff800080a6b000
> > > >> x23: ffff8000808fc320 x22: ffff8000809cc000 x21: ffff00002fe74670
> > > >> x20: ffff800080a90000 x19: 0000000000000000 x18: fffffffffffe0b10
> > > >> x17: ffff8000809f9480 x16: fffffc0000002248 x15: ffff80008090af28
> > > >> x14: fffffffffffc0b0f x13: 6461656861206369 x12: 6e6170202c29324c
> > > >> x11: 452074612064656c x10: 6261736964282045 x9 :
> > > 6428204552532074
> > > >> x8 : ffff80008090af28 x7 : ffff8000808f3970 x6 : 000000000000000c
> > > >> x5 : 000000000000002a x4 : 0000000000000000 x3 :
> > > 0000000000000000
> > > >> x2 : 0000000000000000 x1 : ffff8000808fd0c0 x0 :
> 000000000000003c
> > > >> Call trace:
> > > >>   gic_cpu_sys_reg_init+0x58/0x2e4
> > > >>   gic_cpu_init.part.0+0xa8/0x114
> > > >>   gic_init_bases+0x408/0x684
> > > >>   gic_of_init+0x298/0x300
> > > >>   of_irq_init+0x1c8/0x368
> > > >>   irqchip_init+0x14/0x1c
> > > >>   init_IRQ+0x98/0xac
> > > >>   start_kernel+0x250/0x5b8
> > > >>   __primary_switched+0xb4/0xbc
> > > >> Code: 9260df39 d3441f33 d538cca0 36001180 (d538cc80) )
> > > >> ---[ end trace 0000000000000000 ]-----
> > > >> Kernel panic - not syncing: Attempted to kill the idle task!k!
> > > >> ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]-----
> > > >>
> > > >>
> > > >> The kernel panic is related to GIC distributor (currently under debug) but
> > > AFAIU,
> > > >> this has nothing to do with the UART not working on early stages.
> > > >
> > > >
> > > > Yes, I agree. GIC issue and UART (at least the polling mode) should be
> > > indendent.
> > > >
> > > > Best regards
> > > >
> > > > Dirk
> > > >
> > > >
> > > >> Thanks in advanced for your advice,
> > > >> Cheers,
> > > >> Lior.
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Heiko Schocher <hs@xxxxxxx>
> > > >>> Sent: Thursday, December 21, 2023 1:37 PM
> > > >>> To: Lior Weintraub <liorw@xxxxxxxxxx>
> > > >>> Cc: Dirk Behme <dirk.behme@xxxxxxxxx>; linux-
> > > embedded@xxxxxxxxxxxxxxx
> > > >>> Subject: Re: Debugging early SError exception
> > > >>>
> > > >>> [You don't often get email from hs@xxxxxxx. Learn why this is
> important
> > > at
> > > >>> https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>
> > > >>> CAUTION: External Sender
> > > >>>
> > > >>> Hi Lior,
> > > >>>
> > > >>> On 21.12.23 12:19, Dirk Behme wrote:
> > > >>>> Am 21.12.23 um 11:04 schrieb Lior Weintraub:
> > > >>>>> Thanks Dirk,
> > > >>>>>
> > > >>>>> Regarding the earlyprintk, not sure I know how to make it work.
> > > >>>>> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y
> > on
> > > my
> > > >>> config but it doesn't seem to work.
> > > >>>>> Do I need to pass something in the bootargs from the U-BOOT?
> > > >>>>> Do I need to add that into my device tree?
> > > >>>>> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under
> > > "chosen"
> > > >>> on my DT but it didn't
> > > >>>>> work)
> > > >>>>
> > > >>>> Yes, what has to be enabled and what not and what has to be set
> how
> > is
> > > often
> > > >>> confusing. I think this
> > > >>>> is not common for all systems, so I think to be on the safe side you
> > have
> > > to look
> > > >>> into the code for
> > > >>>> you system. Or short; The code is the documentation ;)
> > > >>>>
> > > >>>>
> > > >>>>> The UART I am using is "snps,dw-apb-uart".
> > > >>>>>
> > > >>>>> Last week, to output the early logs I have implemented this hack:
> > > >>>>> 1. Modify printk macro to run my print_func
> > > >>>>> 2. This print_func wrote the characters into a single global variable
> > (u32
> > > >>> simul_uart;)
> > > >>>>> 3. Get the address location of this global variable and extract all
> writes
> > to
> > > it
> > > >>> from the Tarmac
> > > >>>>> logs.
> > > >>>>>
> > > >>>>> This is a very slow and tedious process but it helped me identify the
> > > initial
> > > >>> SError.
> > > >>>>> Initially I thought I can write directly into the UART FIFO register
> > (which I
> > > know
> > > >>> the address)
> > > >>>>> but this didn't work because Linux already setup the MMU so I guess
> I
> > > need to
> > > >>> know the virtual
> > > >>>>> address of this FIFO.
> > > >>>>> Do I need to use __phys_to_virt of some sort?
> > > >>>>
> > > >>>> Yes, I think so. Have a look to the existing serial driver, too. It should
> do
> > > whats
> > > >>> needed, and you
> > > >>>> can borrow that, then.
> > > >>>
> > > >>> If you have access to the RAM after the crash (through a debugger or in
> > > >>> your bootloader) and your mem is stable, find out the address of
> > > __log_buf
> > > >>> in System.map. Thats the buffer where printk writes into it, and so
> > > dumping
> > > >>> the content is what you would see in case uart works...
> > > >>>
> > > >>> Hope it helps!
> > > >>>
> > > >>> bye,
> > > >>> Heiko
> > > >>>>
> > > >>>> Best regards
> > > >>>>
> > > >>>> Dirk
> > > >>>>
> > > >>>>
> > > >>>>> Cheers,
> > > >>>>> Lior.
> > > >>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > > >>>>>> Sent: Thursday, December 21, 2023 10:30 AM
> > > >>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > > embedded@xxxxxxxxxxxxxxx
> > > >>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>
> > > >>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why
> > > this is
> > > >>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>
> > > >>>>>> CAUTION: External Sender
> > > >>>>>>
> > > >>>>>> Am 21.12.23 um 08:43 schrieb Lior Weintraub:
> > > >>>>>>> Hi Dirk,
> > > >>>>>>>
> > > >>>>>>> We found that the issue was at the early stages of Barebox (a.k.a
> U-
> > > BOOT
> > > >>>>>> v2).
> > > >>>>>>
> > > >>>>>> Glad to hear that! :)
> > > >>>>>>
> > > >>>>>>> Our implementation of putc_ll (on debug_ll) was writing into the
> > > UART Tx
> > > >>>>>> FIFO without checking if the FIFO is full.
> > > >>>>>>> Once the fifo got full it caused this SError probably because the
> > UART
> > > IP
> > > >>>>>> generated an apberror signal.
> > > >>>>>>
> > > >>>>>> Thanks for the report!
> > > >>>>>>
> > > >>>>>>> Now the Linux is running and doesn't report the SError again but
> > now
> > > we
> > > >>>>>> face another issue.
> > > >>>>>>> We see that the PC is getting into a "report_bug" function.
> > > >>>>>>> The Linux doesn't print anything to the UART (probably since it
> > hasn't
> > > got to
> > > >>>>>> the point where the console is configured?).
> > > >>>>>>
> > > >>>>>> For cases like this using earlyprintk is usually a good option. Check
> > > >>>>>> the Linux kernel serial console (UART) dirver of you SoC if it
> > > >>>>>> supports it. In the end it should be "just" a function in the serial
> > > >>>>>> console driver which outputs the console data via polling before
> > > >>>>>> (later) the interrupt driven console part takes over.
> > > >>>>>>
> > > >>>>>> Best regards
> > > >>>>>>
> > > >>>>>> Dirk
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> Since our debug means are limited it can take some time to find
> the
> > > root
> > > >>>>>> cause.
> > > >>>>>>>
> > > >>>>>>> I will keep you posted and update our findings.
> > > >>>>>>> Love to hear your thoughts,
> > > >>>>>>>
> > > >>>>>>> Cheers,
> > > >>>>>>> Lior.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> -----Original Message-----
> > > >>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > > >>>>>>>> Sent: Tuesday, December 19, 2023 3:37 PM
> > > >>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > > embedded@xxxxxxxxxxxxxxx
> > > >>>>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>>>
> > > >>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn
> > why
> > > this is
> > > >>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>>>
> > > >>>>>>>> CAUTION: External Sender
> > > >>>>>>>>
> > > >>>>>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub:
> > > >>>>>>>>> Thanks Dirk,
> > > >>>>>>>>
> > > >>>>>>>> Welcome :)
> > > >>>>>>>>
> > > >>>>>>>> In case you find the root cause it would be nice to get some
> generic
> > > >>>>>>>> description of it so that we can learn something :)
> > > >>>>>>>>
> > > >>>>>>>> Best regards
> > > >>>>>>>>
> > > >>>>>>>> Dirk
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>>> -----Original Message-----
> > > >>>>>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx>
> > > >>>>>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM
> > > >>>>>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-
> > > >>>>>> embedded@xxxxxxxxxxxxxxx
> > > >>>>>>>>>> Subject: Re: Debugging early SError exception
> > > >>>>>>>>>>
> > > >>>>>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn
> > > why this
> > > >>>>>> is
> > > >>>>>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >>>>>>>>>>
> > > >>>>>>>>>> CAUTION: External Sender
> > > >>>>>>>>>>
> > > >>>>>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub:
> > > >>>>>>>>>>> Hi,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have a new SoC with eLinux porting (kernel v6.5).
> > > >>>>>>>>>>> This SoC is ARM64 (A53) single core based device.
> > > >>>>>>>>>>> It runs correctly on QEMU but fails with SError on emulation
> > > platform
> > > >>>>>>>>>> (Synopsys Zebu running our SoC model).
> > > >>>>>>>>>>> There is no debugger connected to this emulation but there
> are
> > > several
> > > >>>>>>>>>> debug capabilities we can use:
> > > >>>>>>>>>>> 1. Generating wave dump of CPU signals
> > > >>>>>>>>>>> 2. Generate a Tarmac log
> > > >>>>>>>>>>> 3. UART
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Since the SError happens at early stages of Linux boot the
> UART
> > > is not
> > > >>>>>>>>>> enabled yet.
> > > >>>>>>>>>>>      From the Tarmac log we can see:
> > > >>>>>>>>>>>       3824884521 ps  ES  (ffff800080760888:d65f03c0) O
> > > el1h_ns:   ret
> > > >>>>>>>>>> (parse_early_param)
> > > >>>>>>>>>>>       3824884522 ps  ES  (ffff800080763a60:d2801800) O
> > > el1h_ns:   mov
> > > >>>>>>>> x0,
> > > >>>>>>>>>> #0xc0   //      #192    (setup_arch)
> > > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > > >>>>>>>>>>>       3824884523 ps  ES  (ffff800080763a64:d51b4220) O
> > > el1h_ns:   msr
> > > >>>>>>>>>> daif,   x0      (setup_arch)
> > > >>>>>>>>>>>                          R CPSR 600000c5
> > > >>>>>>>>>>>       3824884529 ps  ES  System Error (Abort)
> > > >>>>>>>>>>>                          EXC [0x380] SError/vSError Current EL with
> SP_ELx
> > > >>>>>>>>>>>                          R ESR_EL1 (AARCH64) bf000002
> > > >>>>>>>>>>>                          R CPSR 600003c5
> > > >>>>>>>>>>>                          R SPSR_EL1 (AARCH64) 600000c5
> > > >>>>>>>>>>>                          R ELR_EL1 (AARCH64) ffff8000 80763a68
> > > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b80:d10543ff) O
> > > el1h_ns:   sub
> > > >>>>>>>> sp,
> > > >>>>>>>>>> sp,     #0x150  (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884925 ps  ES  (ffff800080010b84:8b2063ff) O
> > > el1h_ns:   add
> > > >>>>>>>> sp,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3d10
> > > >>>>>>>>>>>       3824884926 ps  ES  (ffff800080010b88:cb2063e0) O
> > > el1h_ns:   sub
> > > >>>>>>>> x0,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R X0 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884927 ps  ES  (ffff800080010b8c:37700080) O
> > > el1h_ns:   tbnz
> > > >>>>>>>> w0,
> > > >>>>>>>>>> #14,    ffff800080010b9c        <vectors+0x39c>         (vectors)
> > > >>>>>>>>>>>       3824884935 ps  ES  (ffff800080010b90:cb2063e0) O
> > > el1h_ns:   sub
> > > >>>>>>>> x0,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R X0 (AARCH64) 00000000 000000c0
> > > >>>>>>>>>>>       3824884937 ps  ES  (ffff800080010b94:cb2063ff) O
> > > el1h_ns:   sub
> > > >>>>>> sp,
> > > >>>>>>>>>> sp,     x0      (vectors)
> > > >>>>>>>>>>>                          R SP_EL1 (AARCH64) ffff8000 808f3c50
> > > >>>>>>>>>>>       3824884938 ps  ES  (ffff800080010b98:140001ef) O
> > > el1h_ns:   b
> > > >>>>>>>>>> ffff800080011354        <el1h_64_error>         (vectors)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> If I understand correctly, the exception happened sometime
> > > earlier
> > > >>> and
> > > >>>>>>>> only
> > > >>>>>>>>>> now Linux boot code (setup_arch) opened the exception
> > handling
> > > and as
> > > >>>>>> a
> > > >>>>>>>>>> result we immediately jump to the SError exception handler.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Yes, that sounds reasonable. If I understood correctly, you are
> > > >>>>>>>>>> running something "quite new" on some software (QEMU)
> and
> > > >>>>>> hardware
> > > >>>>>>>>>> (Synopsis) simulators.
> > > >>>>>>>>>>
> > > >>>>>>>>>> That would mean that you have new hardware with e.g. new
> > > memory
> > > >>>>>> map
> > > >>>>>>>>>> not used before. What you describe might sound like in the
> code
> > > before
> > > >>>>>>>>>> Linux (boot loader) there is anything resulting in the SError.
> This
> > > >>>>>>>>>> might be an access to non-existing or non-enabled hardware.
> > I.e.
> > > it
> > > >>>>>>>>>> might be that you try to access (read/write) an address what is
> > > not
> > > >>>>>>>>>> available, yet (or just invalid). It's hard to debug that. In case
> you
> > > >>>>>>>>>> are able to modify the code before Linux (the boot loader?)
> you
> > > might
> > > >>>>>>>>>> try to enable SError exceptions, there, too. To get it earlier and
> > > >>>>>>>>>> with that make the search window smaller. I'm not that
> familiar
> > > with
> > > >>>>>>>>>> QEMU, but could you try to trace which (all?) hardware
> accesses
> > > your
> > > >>>>>>>>>> code does. And with that analyse all accesses and with that
> > check
> > > if
> > > >>>>>>>>>> all these accesses are valid even on the hardware (Synopsis)
> > > emulation
> > > >>>>>>>>>> system? That should be checked from valid address and from
> > > hardware
> > > >>>>>>>>>> subsystem enablement point of view.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hth,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Dirk
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>>      From the Linux source:
> > > >>>>>>>>>>>           parse_early_param();
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>           dynamic_scs_init();
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>           /*
> > > >>>>>>>>>>>            * Unmask asynchronous aborts and fiq after bringing up
> > > possible
> > > >>>>>>>>>>>            * earlycon. (Report possible System Errors once we can
> > > report
> > > >>> this
> > > >>>>>>>>>>>            * occurred).
> > > >>>>>>>>>>>            */
> > > >>>>>>>>>>>           local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is
> > > when we
> > > >>>>>> get
> > > >>>>>>>> the
> > > >>>>>>>>>> exception.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> After some kernel hacking (replacing printk) we could extract
> > the
> > > logs:
> > > >>>>>>>>>>> 6Booting Linux on physical CPU 0x0000000000
> [0x410fd034]
> > > >>>>>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-
> > > linux-gnu-
> > > >>>>>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0,
> > GNU
> > > ld
> > > >>>>>> (GNU
> > > >>>>>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023
> > > >>>>>>>>>>> 6Machine model: Pliops Spider MK-I EVK
> > > >>>>>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 --
> > SError
> > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >>>>>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS
> > > BTYPE=--)
> > > >>>>>>>>>>> pc : setup_arch+0x13c/0x5ac
> > > >>>>>>>>>>> lr : setup_arch+0x134/0x5ac
> > > >>>>>>>>>>> sp : ffff8000808f3da0
> > > >>>>>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27:
> > > >>>>>>>>>> 0000000005e31b58c
> > > >>>>>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24:
> > > >>>>>>>>>> ffff8000808f8000c
> > > >>>>>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21:
> > > >>>>>>>> ffff800080010000c
> > > >>>>>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18:
> > > >>>>>> 000000002266684ac
> > > >>>>>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15:
> > > >>>>>>>>>> 0000000000000008c
> > > >>>>>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12:
> > > >>>>>> 0000000000000003c
> > > >>>>>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 :
> > > >>>>>>>> 0000000000000038c
> > > >>>>>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 :
> > > >>>>>>>> 0000000000000001c
> > > >>>>>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 :
> > > >>>>>>>>>> 0000000000000065c
> > > >>>>>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 :
> > > >>>>>>>>>> 00000000000000c0c
> > > >>>>>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt
> > > >>>>>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101
> > > >>>>>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT)
> > > >>>>>>>>>>> Call trace:
> > > >>>>>>>>>>>       dump_backtrace+0x9c/0xd0
> > > >>>>>>>>>>>       show_stack+0x14/0x1c
> > > >>>>>>>>>>>       dump_stack_lvl+0x44/0x58
> > > >>>>>>>>>>>       dump_stack+0x14/0x1c
> > > >>>>>>>>>>>       panic+0x2e0/0x33c
> > > >>>>>>>>>>>       nmi_panic+0x68/0x6c
> > > >>>>>>>>>>>       arm64_serror_panic+0x68/0x78
> > > >>>>>>>>>>>       do_serror+0x24/0x54
> > > >>>>>>>>>>>       el1h_64_error_handler+0x2c/0x40
> > > >>>>>>>>>>>       el1h_64_error+0x64/0x68
> > > >>>>>>>>>>>       setup_arch+0x13c/0x5ac
> > > >>>>>>>>>>>       start_kernel+0x5c/0x5b8
> > > >>>>>>>>>>>       __primary_switched+0xb4/0xbc
> > > >>>>>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError
> > > Interrupt ]---
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Can you please advice how to proceed with debugging?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks in advanced,
> > > >>>>>>>>>>> Cheers,
> > > >>>>>>>>>>> Lior.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>> --
> > > >>> DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > > >>> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell,
> > Germany
> > > >>> Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> > > hs@xxxxxxx
> > > >
> > >
> > > --
> > > DENX Software Engineering GmbH,      Managing Director: Erika Unter
> > > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> > > Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email:
> hs@xxxxxxx





[Index of Archives]     [Gstreamer Embedded]     [Linux MMC Devel]     [U-Boot V2]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux ARM Kernel]     [Linux OMAP]     [Linux SCSI]

  Powered by Linux