Thanks Heiko, Will do that. Cheers, Lior. > -----Original Message----- > From: Heiko Schocher <hs@xxxxxxx> > Sent: Thursday, December 21, 2023 1:37 PM > To: Lior Weintraub <liorw@xxxxxxxxxx> > Cc: Dirk Behme <dirk.behme@xxxxxxxxx>; linux-embedded@xxxxxxxxxxxxxxx > Subject: Re: Debugging early SError exception > > [You don't often get email from hs@xxxxxxx. Learn why this is important at > https://aka.ms/LearnAboutSenderIdentification ] > > CAUTION: External Sender > > Hi Lior, > > On 21.12.23 12:19, Dirk Behme wrote: > > Am 21.12.23 um 11:04 schrieb Lior Weintraub: > >> Thanks Dirk, > >> > >> Regarding the earlyprintk, not sure I know how to make it work. > >> I have defined CONFIG_EARLY_PRINTK=y and CONFIG_DEBUG_LL=y on my > config but it doesn't seem to work. > >> Do I need to pass something in the bootargs from the U-BOOT? > >> Do I need to add that into my device tree? > >> (Tried to set bootargs = "console=ttyS0,115200 earlyprintk"; under > "chosen" on my DT but it didn't > >> work) > > > > Yes, what has to be enabled and what not and what has to be set how is > often confusing. I think this > > is not common for all systems, so I think to be on the safe side you have to > look into the code for > > you system. Or short; The code is the documentation ;) > > > > > >> The UART I am using is "snps,dw-apb-uart". > >> > >> Last week, to output the early logs I have implemented this hack: > >> 1. Modify printk macro to run my print_func > >> 2. This print_func wrote the characters into a single global variable (u32 > simul_uart;) > >> 3. Get the address location of this global variable and extract all writes to it > from the Tarmac > >> logs. > >> > >> This is a very slow and tedious process but it helped me identify the initial > SError. > >> Initially I thought I can write directly into the UART FIFO register (which I > know the address) > >> but this didn't work because Linux already setup the MMU so I guess I need > to know the virtual > >> address of this FIFO. > >> Do I need to use __phys_to_virt of some sort? > > > > Yes, I think so. Have a look to the existing serial driver, too. It should do > whats needed, and you > > can borrow that, then. > > If you have access to the RAM after the crash (through a debugger or in > your bootloader) and your mem is stable, find out the address of __log_buf > in System.map. Thats the buffer where printk writes into it, and so dumping > the content is what you would see in case uart works... > > Hope it helps! > > bye, > Heiko > > > > Best regards > > > > Dirk > > > > > >> Cheers, > >> Lior. > >> > >>> -----Original Message----- > >>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > >>> Sent: Thursday, December 21, 2023 10:30 AM > >>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux-embedded@xxxxxxxxxxxxxxx > >>> Subject: Re: Debugging early SError exception > >>> > >>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why this is > >>> important at https://aka.ms/LearnAboutSenderIdentification ] > >>> > >>> CAUTION: External Sender > >>> > >>> Am 21.12.23 um 08:43 schrieb Lior Weintraub: > >>>> Hi Dirk, > >>>> > >>>> We found that the issue was at the early stages of Barebox (a.k.a U- > BOOT > >>> v2). > >>> > >>> Glad to hear that! :) > >>> > >>>> Our implementation of putc_ll (on debug_ll) was writing into the UART Tx > >>> FIFO without checking if the FIFO is full. > >>>> Once the fifo got full it caused this SError probably because the UART IP > >>> generated an apberror signal. > >>> > >>> Thanks for the report! > >>> > >>>> Now the Linux is running and doesn't report the SError again but now we > >>> face another issue. > >>>> We see that the PC is getting into a "report_bug" function. > >>>> The Linux doesn't print anything to the UART (probably since it hasn't got > to > >>> the point where the console is configured?). > >>> > >>> For cases like this using earlyprintk is usually a good option. Check > >>> the Linux kernel serial console (UART) dirver of you SoC if it > >>> supports it. In the end it should be "just" a function in the serial > >>> console driver which outputs the console data via polling before > >>> (later) the interrupt driven console part takes over. > >>> > >>> Best regards > >>> > >>> Dirk > >>> > >>> > >>>> Since our debug means are limited it can take some time to find the root > >>> cause. > >>>> > >>>> I will keep you posted and update our findings. > >>>> Love to hear your thoughts, > >>>> > >>>> Cheers, > >>>> Lior. > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > >>>>> Sent: Tuesday, December 19, 2023 3:37 PM > >>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux- > embedded@xxxxxxxxxxxxxxx > >>>>> Subject: Re: Debugging early SError exception > >>>>> > >>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why this > is > >>>>> important at https://aka.ms/LearnAboutSenderIdentification ] > >>>>> > >>>>> CAUTION: External Sender > >>>>> > >>>>> Am 19.12.23 um 14:23 schrieb Lior Weintraub: > >>>>>> Thanks Dirk, > >>>>> > >>>>> Welcome :) > >>>>> > >>>>> In case you find the root cause it would be nice to get some generic > >>>>> description of it so that we can learn something :) > >>>>> > >>>>> Best regards > >>>>> > >>>>> Dirk > >>>>> > >>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Dirk Behme <dirk.behme@xxxxxxxxx> > >>>>>>> Sent: Tuesday, December 19, 2023 9:09 AM > >>>>>>> To: Lior Weintraub <liorw@xxxxxxxxxx>; linux- > >>> embedded@xxxxxxxxxxxxxxx > >>>>>>> Subject: Re: Debugging early SError exception > >>>>>>> > >>>>>>> [You don't often get email from dirk.behme@xxxxxxxxx. Learn why > this > >>> is > >>>>>>> important at https://aka.ms/LearnAboutSenderIdentification ] > >>>>>>> > >>>>>>> CAUTION: External Sender > >>>>>>> > >>>>>>> Am 17.12.23 um 22:32 schrieb Lior Weintraub: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> We have a new SoC with eLinux porting (kernel v6.5). > >>>>>>>> This SoC is ARM64 (A53) single core based device. > >>>>>>>> It runs correctly on QEMU but fails with SError on emulation > platform > >>>>>>> (Synopsys Zebu running our SoC model). > >>>>>>>> There is no debugger connected to this emulation but there are > several > >>>>>>> debug capabilities we can use: > >>>>>>>> 1. Generating wave dump of CPU signals > >>>>>>>> 2. Generate a Tarmac log > >>>>>>>> 3. UART > >>>>>>>> > >>>>>>>> Since the SError happens at early stages of Linux boot the UART is > not > >>>>>>> enabled yet. > >>>>>>>> From the Tarmac log we can see: > >>>>>>>> 3824884521 ps ES (ffff800080760888:d65f03c0) O el1h_ns: > ret > >>>>>>> (parse_early_param) > >>>>>>>> 3824884522 ps ES (ffff800080763a60:d2801800) O el1h_ns: > mov > >>>>> x0, > >>>>>>> #0xc0 // #192 (setup_arch) > >>>>>>>> R X0 (AARCH64) 00000000 000000c0 > >>>>>>>> 3824884523 ps ES (ffff800080763a64:d51b4220) O el1h_ns: > msr > >>>>>>> daif, x0 (setup_arch) > >>>>>>>> R CPSR 600000c5 > >>>>>>>> 3824884529 ps ES System Error (Abort) > >>>>>>>> EXC [0x380] SError/vSError Current EL with SP_ELx > >>>>>>>> R ESR_EL1 (AARCH64) bf000002 > >>>>>>>> R CPSR 600003c5 > >>>>>>>> R SPSR_EL1 (AARCH64) 600000c5 > >>>>>>>> R ELR_EL1 (AARCH64) ffff8000 80763a68 > >>>>>>>> 3824884925 ps ES (ffff800080010b80:d10543ff) O el1h_ns: > sub > >>>>> sp, > >>>>>>> sp, #0x150 (vectors) > >>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3c50 > >>>>>>>> 3824884925 ps ES (ffff800080010b84:8b2063ff) O el1h_ns: > add > >>>>> sp, > >>>>>>> sp, x0 (vectors) > >>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3d10 > >>>>>>>> 3824884926 ps ES (ffff800080010b88:cb2063e0) O el1h_ns: > sub > >>>>> x0, > >>>>>>> sp, x0 (vectors) > >>>>>>>> R X0 (AARCH64) ffff8000 808f3c50 > >>>>>>>> 3824884927 ps ES (ffff800080010b8c:37700080) O el1h_ns: > tbnz > >>>>> w0, > >>>>>>> #14, ffff800080010b9c <vectors+0x39c> (vectors) > >>>>>>>> 3824884935 ps ES (ffff800080010b90:cb2063e0) O el1h_ns: > sub > >>>>> x0, > >>>>>>> sp, x0 (vectors) > >>>>>>>> R X0 (AARCH64) 00000000 000000c0 > >>>>>>>> 3824884937 ps ES (ffff800080010b94:cb2063ff) O el1h_ns: > sub > >>> sp, > >>>>>>> sp, x0 (vectors) > >>>>>>>> R SP_EL1 (AARCH64) ffff8000 808f3c50 > >>>>>>>> 3824884938 ps ES (ffff800080010b98:140001ef) O el1h_ns: b > >>>>>>> ffff800080011354 <el1h_64_error> (vectors) > >>>>>>>> > >>>>>>>> If I understand correctly, the exception happened sometime earlier > and > >>>>> only > >>>>>>> now Linux boot code (setup_arch) opened the exception handling > and as > >>> a > >>>>>>> result we immediately jump to the SError exception handler. > >>>>>>> > >>>>>>> > >>>>>>> Yes, that sounds reasonable. If I understood correctly, you are > >>>>>>> running something "quite new" on some software (QEMU) and > >>> hardware > >>>>>>> (Synopsis) simulators. > >>>>>>> > >>>>>>> That would mean that you have new hardware with e.g. new > memory > >>> map > >>>>>>> not used before. What you describe might sound like in the code > before > >>>>>>> Linux (boot loader) there is anything resulting in the SError. This > >>>>>>> might be an access to non-existing or non-enabled hardware. I.e. it > >>>>>>> might be that you try to access (read/write) an address what is not > >>>>>>> available, yet (or just invalid). It's hard to debug that. In case you > >>>>>>> are able to modify the code before Linux (the boot loader?) you might > >>>>>>> try to enable SError exceptions, there, too. To get it earlier and > >>>>>>> with that make the search window smaller. I'm not that familiar with > >>>>>>> QEMU, but could you try to trace which (all?) hardware accesses your > >>>>>>> code does. And with that analyse all accesses and with that check if > >>>>>>> all these accesses are valid even on the hardware (Synopsis) > emulation > >>>>>>> system? That should be checked from valid address and from > hardware > >>>>>>> subsystem enablement point of view. > >>>>>>> > >>>>>>> Hth, > >>>>>>> > >>>>>>> Dirk > >>>>>>> > >>>>>>> > >>>>>>>> From the Linux source: > >>>>>>>> parse_early_param(); > >>>>>>>> > >>>>>>>> dynamic_scs_init(); > >>>>>>>> > >>>>>>>> /* > >>>>>>>> * Unmask asynchronous aborts and fiq after bringing up > possible > >>>>>>>> * earlycon. (Report possible System Errors once we can report > this > >>>>>>>> * occurred). > >>>>>>>> */ > >>>>>>>> local_daif_restore(DAIF_PROCCTX_NOIRQ); <---- This is when > we > >>> get > >>>>> the > >>>>>>> exception. > >>>>>>>> > >>>>>>>> After some kernel hacking (replacing printk) we could extract the > logs: > >>>>>>>> 6Booting Linux on physical CPU 0x0000000000 [0x410fd034] > >>>>>>>> 5Linux version 6.5.0 (pliops@dev-liorw) (aarch64-buildroot-linux- > gnu- > >>>>>>> gcc.br_real (Buildroot 2023.02.1-95-g8391404e23) 11.3.0, GNU ld > >>> (GNU > >>>>>>> Binutils) 2.38) #101 SMP Sun Dec 17 20:09:06 IST 2023 > >>>>>>>> 6Machine model: Pliops Spider MK-I EVK > >>>>>>>> 2SError Interrupt on CPU0, code 0x00000000bf000002 -- SError > >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101 > >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT) > >>>>>>>> pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > >>>>>>>> pc : setup_arch+0x13c/0x5ac > >>>>>>>> lr : setup_arch+0x134/0x5ac > >>>>>>>> sp : ffff8000808f3da0 > >>>>>>>> x29: ffff8000808f3da0c x28: 0000000008758074c x27: > >>>>>>> 0000000005e31b58c > >>>>>>>> x26: 0000000000000001c x25: 0000000007e5f728c x24: > >>>>>>> ffff8000808f8000c > >>>>>>>> x23: ffff8000808f8600c x22: ffff8000807b6000c x21: > >>>>> ffff800080010000c > >>>>>>>> x20: ffff800080a1e000c x19: fffffbfffddfe190c x18: > >>> 000000002266684ac > >>>>>>>> x17: 00000000fcad60bbc x16: 0000000000001800c x15: > >>>>>>> 0000000000000008c > >>>>>>>> x14: ffffffffffffffffc x13: 0000000000000000c x12: > >>> 0000000000000003c > >>>>>>>> x11: 0101010101010101c x10: ffffffffffee87dfc x9 : > >>>>> 0000000000000038c > >>>>>>>> x8 : 0101010101010101c x7 : 7f7f7f7f7f7f7f7fc x6 : > >>>>> 0000000000000001c > >>>>>>>> x5 : 0000000000000000c x4 : 8000000000000000c x3 : > >>>>>>> 0000000000000065c > >>>>>>>> x2 : 0000000000000000c x1 : 0000000000000000c x0 : > >>>>>>> 00000000000000c0c > >>>>>>>> 0Kernel panic - not syncing: Asynchronous SError Interrupt > >>>>>>>> CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0 #101 > >>>>>>>> Hardware name: Pliops Spider MK-I EVK (DT) > >>>>>>>> Call trace: > >>>>>>>> dump_backtrace+0x9c/0xd0 > >>>>>>>> show_stack+0x14/0x1c > >>>>>>>> dump_stack_lvl+0x44/0x58 > >>>>>>>> dump_stack+0x14/0x1c > >>>>>>>> panic+0x2e0/0x33c > >>>>>>>> nmi_panic+0x68/0x6c > >>>>>>>> arm64_serror_panic+0x68/0x78 > >>>>>>>> do_serror+0x24/0x54 > >>>>>>>> el1h_64_error_handler+0x2c/0x40 > >>>>>>>> el1h_64_error+0x64/0x68 > >>>>>>>> setup_arch+0x13c/0x5ac > >>>>>>>> start_kernel+0x5c/0x5b8 > >>>>>>>> __primary_switched+0xb4/0xbc > >>>>>>>> 0---[ end Kernel panic - not syncing: Asynchronous SError Interrupt > ]--- > >>>>>>>> > >>>>>>>> Can you please advice how to proceed with debugging? > >>>>>>>> > >>>>>>>> Thanks in advanced, > >>>>>>>> Cheers, > >>>>>>>> Lior. > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >> > > > > -- > DENX Software Engineering GmbH, Managing Director: Erika Unter > HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany > Phone: +49-8142-66989-52 Fax: +49-8142-66989-80 Email: hs@xxxxxxx