RE: [PATCH v2] Porting barebox to a new SoC

Lior Weintraub <liorw@xxxxxxxxxx> · Thu, 3 Aug 2023 11:17:08 +0000

Hi Ahmad,

Hope you had a great time on EOSS 2023 :-)
Quick recap and additional info on the current issue:

1. 
The spider-soc QEMU with the additional GICv3 and Timers was tested with a bare-metal code and proved to be OK.
This bare-metal code sets the A53 timers and GICv3 to handle interrupts on various execution levels as well as various security levels:
EL1_NS_PHYSICAL_TIMER set as GROUP1_NON_SECURE
EL1_SCR_PHYSICAL_TIMER set as GROUP1_SECURE
EL2_PHYSICAL_TIMER set as GROUP1_SECURE
VIRTUAL_TIMER set as GROUP1_NON_SECURE

2.
The kernel we build with Buildroot runs OK on virt QEMU but gets stuck in the middle when we use our spider-soc QEMU.
There are few differences between those runs:
a.
The virt QEMU is executed with -kernel switch and hence the QEMU itself implements the "bootloader" and prepares the DT given to the Kernel.
When the Kernel starts on this platforms it starts at EL1.
b.
The spider-soc QEMU is executed with -device loader,file=spider-soc-bl1.elf
Just for easy execution and testing, this executable includes all the needed binaries (as const data blobs) and it copies the binaries into correct locations before jumping to Barebox execution.
The list of binaries includes the barebox, kernel, dt, and rootfs.
As you recall, BL31 is compiled via Trusted-Firmware-A and has all it's functions as empty stubs because we currently don't care about CPU power states.
The prove that BL31 is executed correctly is that Barebox now runs at EL2.
At that point the Linux kernel is starting and as I mentioned gets stuck in the middle (cpu_do_idle function. more details to follow).

Debugging the kernel with GDB revealed few differences:
1. When running with Barebox, the kernel starts at EL2 and at some point moves to EL1.
Not sure if that has some impact on the following issue but thought it is worth mentioning.
(We get a "CPU: All CPU(s) started at EL2" trace)
Another difference that might be related to this execution level is that timers setting shows that it uses the physical timer (as oppose to virt QEMU run that uses the virtual timer):
The spider-soc QEMU Timers dump:
CNTFRQ_EL0 = 0x3b9aca0
CNTP_CTL_EL0 = 0x5
CNTV_CTL_EL0 = 0x0
CNTP_TVAL_EL0 = 0xff1f2ad5
CNTP_CVAL_EL0 = 0xac5c3240
CNTV_TVAL_EL0 = 0x52c2d916
CNTV_CVAL_EL0 = 0x0

The virt QEMU Timers dump:
CNTFRQ_EL0 = 0x3b9aca0
CNTP_CTL_EL0 = 0x0
CNTV_CTL_EL0 = 0x5
CNTP_TVAL_EL0 = 0xb8394fbc
CNTP_CVAL_EL0 = 0x0
CNTV_TVAL_EL0 = 0xffd18e39
CNTV_CVAL_EL0 = 0x479858aa

2. When running with Barebox, the kernel fails to correctly set the GICv3 registers.
So in other words, there are no timer events and hence the scheduler is not running.
The code get stuck on cpu_do_idle but we also found that the RCU cb_list is not empty (probably explains why scheduler haven't started (just a guess)).
We placed a breakpoint just before calling wait_for_completion (from function rcu_barrier on kernel/rcu/tree.c) and found:
bt
#0  rcu_barrier () at kernel/rcu/tree.c:4064
#1  0xffffffc08059e1b4 in mark_readonly () at init/main.c:1789
#2  kernel_init (unused=<optimized out>) at init/main.c:1838
#3  0xffffffc080015e48 in ret_from_fork () at arch/arm64/kernel/entry.S:853

At that point rcu_state.barrier_cpu_count.counter is 1 (as oppose to virt QEMU where it is 0 at that point)
If we place the breakpoint a bit earlier in this rcu_barrier function (just before the for_each_possible_cpu loop) and run few more steps (to get the rdp) we see that rdp->cblist.len is 0x268 (616):
p/x rdp->cblist
$1 = {head = 0xffffffc0808f06d0, tails = {0xffffff802fe55a78, 0xffffff802fe55a78, 0xffffff802fe55a78, 0xffffff80001c22c8}, gp_seq = {0x0, 0x0, 0x0, 0x0}, len = 0x268, seglen = {0x0, 0x0, 0x0, 0x268}, flags = 0x1}

When we compare that with virt QEMU we see that the rdp->cblist.len is 0 there.

IMHO, this all is a result of the GICv3 settings that were not applied properly.
As a result there are no timer interrupts.

Further debugging on the GICv3 settings showed that the code (function gic_cpu_init on drivers/irqchip/irq-gic-v3.c) tries to write 0xffffffff to GICR_IGROUPR0 (Configure SGIs/PPIs as non-secure Group-1) but when we try to read it back we get all zeros.
Dumping GICv3 settings after the call to init_IRQ:
Showing only the differences:
			Spider-SoC QEMU	virt QEMU
GICD_CTLR =      	0x00000012		0x00000053
GICD_TYPER =     	0x037a0402		0x037a0007
GICR0_IGROUPR0 =	0x00000000		0xffffffff
GICR0_ISENABLER0 =	0x00000000		0x0000007f
GICR0_ICENABLER0 =	0x00000000		0x0000007f
GICR0_ICFGR0 =	0x00000000		0xaaaaaaaa

Any thoughts?
As always, your support is much appreciated!

Cheers,
Lior. 

> -----Original Message-----
> From: Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx>
> Sent: Friday, June 30, 2023 8:53 AM
> To: Lior Weintraub <liorw@xxxxxxxxxx>; Ahmad Fatoum <ahmad@xxxxxx>;
> barebox@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2] Porting barebox to a new SoC
> 
> CAUTION: External Sender
> 
> Hi Lior,
> 
> On 25.06.23 22:33, Lior Weintraub wrote:
> > Hello Ahmad,
> 
> [Sorry for the delay, we're at EOSS 2023 currently]
> 
> > I failed to reproduce this issue on virt because the addresses and peripherals
> on virt machine are different and it is difficult to change our code to match
> that.
> > If you think this is critical I will make extra effort to make it work.
> > AFAIU, this suggestion was made to debug the "conflict" issue.
> 
> It's not critical, but I'd have liked to understand this, so I can check
> if it's perhaps a barebox bug.
> 
> > Currently the workaround I am using is just to set the size of the kernel
> partition to match the exact size of the "Image" file.
> >
> > The other issue I am facing is that Kernel seems stuck on cpu_do_idle and
> there is no login prompt from the kernel.
> 
> Does it call into PSCI during idle?
> 
> > As you recall, I am running on a custom QEMU that tries to emulate our
> platform.
> > I suspect that I did something wrong with the GICv3 and Timers connectivity.
> > The code I used was based on examples I saw on sbsa-ref.c and virt.c.
> > In addition, I declared the GICv3 and timers on our device tree.
> >
> > I running QEMU with "-d int" so I am also getting trace of exceptions and
> interrupts.
> 
> Nice. Didn't know about this option.
> 
> [snip]
> 
> > Exception return from AArch64 EL3 to AArch64 EL1 PC 0xffffffc00802112c
> > Taking exception 13 [Secure Monitor Call] on CPU 0
> > ...from EL1 to EL3
> > ...with ESR 0x17/0x5e000000
> > ...with ELR 0xffffffc008021640
> > ...to EL3 PC 0x10005400 PSTATE 0x3cd
> > Exception return from AArch64 EL3 to AArch64 EL1 PC 0xffffffc008021640
> 
> Looks fine so far? Doesn't look like it's hanging in EL1.
> 
> [snip]
> 
> > Segment Routing with IPv6
> > In-situ OAM (IOAM) with IPv6
> > sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
> > NET: Registered PF_PACKET protocol family
> > NET: Registered PF_KEY protocol family
> > NET: Registered PF_VSOCK protocol family
> > registered taskstats version 1
> > clk: Disabling unused clocks
> > Freeing unused kernel memory: 1664K
> 
> Not sure. Normally, I'd try again with pd_ignore_unused clk_ignore_unused in
> the
> kernel arguments, but I think you define no clocks or power domains yet in
> the DT?
> 
> You can try again with kernel command line option initcall_debug and see
> what the
> initcall is that is getting stuck. If nothing helps, maybe attach a hardware
> debugger?
> 
> Cheers,
> Ahmad
> 
> --
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |