Re: [PATCH] ARM: mach-qcom: fix support for ipq806x

Christian Marangi <ansuelsmth@xxxxxxxxx> · Wed, 17 Jan 2024 23:46:24 +0100

On Wed, Jan 17, 2024 at 02:17:03PM +0100, Christian Marangi wrote:
> On Wed, Oct 26, 2022 at 10:19:21AM +0200, Linus Walleij wrote:
> > On Tue, Oct 25, 2022 at 1:47 AM Christian Marangi <ansuelsmth@xxxxxxxxx> wrote:
> > 
> > > bad news... yesterday I tested this binding and it's problematic. It
> > > does work and the router correctly boot...
> > 
> > That's actually partly good news :D
> 
> Hi,
> sorry for the necroposting but I got some time and wanted to fix and
> bisect this for good since IPQ806x is finally in a better shape and is
> actually modern enough.
> 
> > 
> > > problem is that SMEM is
> > > broken with such configuration... I assume with this binding, by the
> > > system view ram starts from 0x42000000 instead of 0x40000000 and this
> > > cause SMEM to fail probe with the error "SBL didn't init SMEM".
> > 
> > We need to fix this.
> > 
> 
> Totally but I think the problem is more deep...
> 
> > > This is the location of SMEM entry in ram
> > >
> > >                 smem: smem@41000000 {
> > >                         compatible = "qcom,smem";
> > >                         reg = <0x41000000 0x200000>;
> > >                         no-map;
> > >
> > >                         hwlocks = <&sfpb_mutex 3>;
> > >                 };
> > (...)
> > > Wonder if you have other ideas about this.
> > 
> > So the problem is that the resource is outside of the system RAM?
> > 
> > I don't understand why that triggers it since this is per definition not
> > system RAM, it is SMEM after all. And it is no different in esssence
> > from any memory mapped IO or other things that are outside of
> > the system RAM.
> > 
> > The SMEM node is special since it is created without children thanks
> > to the hack in drivers/of/platform.c.
> > 
> > Then the driver in drivers/soc/qcom/smem.c
> > contains things like this:
> > 
> >         rmem = of_reserved_mem_lookup(pdev->dev.of_node);
> >         if (rmem) {
> >                 smem->regions[0].aux_base = rmem->base;
> >                 smem->regions[0].size = rmem->size;
> >         } else {
> >                 /*
> >                  * Fall back to the memory-region reference, if we're not a
> >                  * reserved-memory node.
> >                  */
> >                 ret = qcom_smem_resolve_mem(smem, "memory-region",
> > &smem->regions[0]);
> >                 if (ret)
> >                         return ret;
> >         }
> > 
> > However it is treated as memory-mapped IO later:
> > 
> >         for (i = 1; i < num_regions; i++) {
> >                 smem->regions[i].virt_base = devm_ioremap_wc(&pdev->dev,
> > 
> > smem->regions[i].aux_base,
> > 
> > smem->regions[i].size);
> >                 if (!smem->regions[i].virt_base) {
> >                         dev_err(&pdev->dev, "failed to remap %pa\n",
> > &smem->regions[i].aux_base);
> >                         return -ENOMEM;
> >                 }
> >         }
> > 
> > As a first hack I would check:
> > 
> > 1. Is it the of_reserved_mem_lookup() or qcom_smem_resolve_smem() stuff
> >    in drivers/soc/qcom/smem.c that is failing?
> > 
> > If yes then:
> > 
> > 2. Add a fallback path just using of_iomap(node) for aux_base and size
> >   with some comment like /* smem is outside of the main memory map */
> >   and see if that works.
> >
> 
> I think we got confused and we didn't read the code correctly. The
> error is "SMEM is not initialized by SBL" that is triggered by...
> 
> 	header = smem->regions[0].virt_base;
> 	if (le32_to_cpu(header->initialized) != 1 ||
> 	    le32_to_cpu(header->reserved)) {
> 		dev_err(&pdev->dev, "SMEM is not initialized by SBL\n",);
> 		return -EINVAL;
> 	}
> 
> I verified correctly that aux_base and size are the correct values
> 0x41000000 and 0x200000. And from what I can see they get correctly
> iomapped.
> 
> Problem is that initialized and reserved have garbage in it. (not random
> data tho but everytime the same data)
> 
> My theory is that somehow the loader is still writing data there but I'm
> a bit lost on how to verify that. (the fact that the data in those
> values is always the same with the same compiled image makes me think
> it's actually just loaded data)
> 
> I also tested with disabling the CONFIG_ARM_ATAG_DTB_COMPAT flag but I
> have the same result.
> 
> What I'm using is this memory node
> 
> 	memory@0 {
> 		reg = <0x42000000 0x1e000000>;
> 		device_type = "memory";
> 	};
> 
> And in chosed I have
> 
> 	chosen {
> 		bootargs = "earlycon";
>                 linux,usable-memory-range = <0x42000000 0x10000000>;
> 	};
> 
> (the size is different just for the sake of it but it should not cause
> problem right?)
> 
> Maybe there is a way to make the SMEM reclaim those RAM space and reinit
> it? (it's a workaround tho)
> 
> Also with the current situation the kernel panics with... But I assume
> this is caused by SMEM malfunctioning (the panic happen right after rpm
> init when the RPM regulators are getting init. Looking at the affected
> codes maybe it's failing at the "Free unused pages" stage?
> 
> [    1.912392] 8<--- cut here ---
> [    1.912431] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [    1.914356] [00000000] *pgd=00000000
> [    1.922676] Internal error: Oops: 80000007 [#1] SMP ARM
> [    1.926158] Modules linked in:
> [    1.931103] CPU: 1 PID: 84 Comm: modprobe Not tainted 6.1.65 #0
> [    1.934229] Hardware name: Generic DT based system
> [    1.940045] PC is at 0x0
> [    1.944902] LR is at release_pages+0x114/0x36c
> [    1.947595] pc : [<00000000>]    lr : [<c04298dc>]    psr: 40000013
> [    1.951851] sp : c27abe18  ip : c13cd5c1  fp : c27abe38
> [    1.958012] r10: 0000009c  r9 : c4018268  r8 : 00000005
> [    1.963220] r7 : c243f400  r6 : c243f400  r5 : 00000098  r4 : df992b54
> [    1.968431] r3 : 00000000  r2 : 00000000  r1 : 60000013  r0 : df992b54
> [    1.975029] Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [    1.981543] Control: 10c5787d  Table: 4367806a  DAC: 00000051
> [    1.988744] Register r0 information: non-slab/vmalloc memory
> [    1.994472] Register r1 information: non-paged memory
> [    2.000200] Register r2 information: NULL pointer
> [    2.005148] Register r3 information: NULL pointer
> [    2.009834] Register r4 information: non-slab/vmalloc memory
> [    2.014525] Register r5 information: non-paged memory
> [    2.020252] Register r6 information: slab kmalloc-1k start c243f400 pointer offset 0 size 1024
> [    2.025206] Register r7 information: slab kmalloc-1k start c243f400 pointer offset 0 size 1024
> [    2.033714] Register r8 information: non-paged memory
> [    2.042301] Register r9 information: non-slab/vmalloc memory
> [    2.047424] Register r10 information: non-paged memory
> [    2.053152] Register r11 information: non-slab/vmalloc memory
> [    2.058100] Register r12 information: non-paged memory
> [    2.063915] Process modprobe (pid: 84, stack limit = 0x(ptrval))
> [    2.068953] Stack: (0xc27abe18 to 0xc27ac000)
> [    2.075115] be00:                                                       00000000 00000000
> [    2.079378] be20: c147514c ffefffcf 00000000 00000000 0000009c 60000013 dfa12928 dfa12b44
> [    2.087537] be40: c27abf24 0000009c c4018000 c401800c c27abf0c c27abf24 00000000 000000f8
> [    2.095697] be60: 00000000 c045b248 ffffffff c27abf0c c35d1400 00000000 c35d1438 c045b4f8
> [    2.103858] be80: c27abf0c 00002000 00000000 c044fb14 00000000 c0b6c2bc c35d1400 ffffffff
> [    2.112016] bea0: ffffffff c35a4c0c 00000000 ffffffff 00000000 00001c01 00000000 c3591510
> [    2.120176] bec0: 00000000 c35d1400 ffffffff c3591510 00000000 c35d1400 00000000 c0458f30
> [    2.128336] bee0: 00000000 c08f35c8 c36ebf00 c35d1400 00010000 00013fff c35a4c0c 00000000
> [    2.136496] bf00: ffffffff 00000000 00000101 c35d1400 ffffffff ffffffff c2420501 00000001
> [    2.144656] bf20: c4018000 c4018000 00000000 00000008 dfde733c dfde7360 dfde7384 dfde73a8
> [    2.152815] bf40: dfa12a44 dfa12948 dfa129d8 dfa12ad4 c35d1400 00000000 c35d1438 00000698
> [    2.160976] bf60: c27abf78 c0318a34 c35d1400 c2731000 c35d1438 c0320604 0000ff00 c258ea00
> [    2.169136] bf80: c2731000 c2456f40 c03002c4 c2456f40 00000000 c0320e0c 000000f8 c0320e6c
> [    2.177294] bfa0: ffffffff c0300060 ffffffff bed38eb4 ffffffff bed38dcc 00000000 ffffffff
> [    2.185455] bfc0: ffffffff bed38eb4 00010f60 000000f8 6474e552 00000020 00000000 00000000
> [    2.193614] bfe0: 6ffffff9 bed38e78 b6f91f1c b6fa4a44 60000010 ffffffff 00000000 00000000
> [    2.201777]  release_pages from tlb_batch_pages_flush+0x3c/0x70
> [    2.209927]  tlb_batch_pages_flush from tlb_finish_mmu+0x4c/0x130
> [    2.215656]  tlb_finish_mmu from exit_mmap+0xec/0x1e0
> [    2.221903]  exit_mmap from mmput+0x40/0x120
> [    2.226939]  mmput from do_exit+0x238/0x890
> [    2.231279]  do_exit from do_group_exit+0x34/0x84
> [    2.235184]  do_group_exit from __wake_up_parent+0x0/0x18
> [    2.240053] Code: bad PC value
> [    2.245556] ---[ end trace 0000000000000000 ]---
> [    2.248448] Kernel panic - not syncing: Fatal exception
> [    2.253158] CPU0: stopping
> [    2.253169] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G      D            6.1.65 #0
> [    2.253180] Hardware name: Generic DT based system
> [    2.253189]  unwind_backtrace from show_stack+0x10/0x14
> [    2.253216]  show_stack from dump_stack_lvl+0x40/0x4c
> [    2.253249]  dump_stack_lvl from do_handle_IPI+0xf0/0x124
> [    2.253276]  do_handle_IPI from ipi_handler+0x18/0x20
> [    2.253293]  ipi_handler from handle_percpu_devid_irq+0x78/0x134
> [    2.253313]  handle_percpu_devid_irq from generic_handle_domain_irq+0x28/0x38
> [    2.253338]  generic_handle_domain_irq from gic_handle_irq+0x74/0x88
> [    2.253361]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> [    2.253391]  generic_handle_arch_irq from call_with_stack+0x18/0x20
> [    2.253419]  call_with_stack from __irq_svc+0x80/0x98
> [    2.253438] Exception stack(0xc1401f00 to 0xc1401f48)
> [    2.253451] 1f00: 00000005 00000000 00000a61 c03128a0 c1408640 00000000 c1404f68 c1404fa4
> [    2.253461] 1f20: 00000000 c13c9c38 00000000 00000000 c14c1f00 c1401f50 c0307148 c030714c
> [    2.253467] 1f40: 60000013 ffffffff
> [    2.253474]  __irq_svc from arch_cpu_idle+0x38/0x3c
> [    2.253500]  arch_cpu_idle from default_idle_call+0x24/0x34
> [    2.253526]  default_idle_call from do_idle+0x1ec/0x240
> [    2.253545]  do_idle from cpu_startup_entry+0x28/0x2c
> [    2.253559]  cpu_startup_entry from kernel_init+0x0/0x12c
> [    2.376160] Rebooting in 1 seconds..
>

Some followup on this... I manage to enable DEBUG_LL and can have debug
output from the decompressor...

>From what I can see fdt_check_mem_start is not called at all...

What I'm using with kernel config are:
CONFIG_ARM_APPENDED_DTB=y
CONFIG_ARM_ATAG_DTB_COMPAT=y
And a downstream patch that mangle all the atags and takes only the
cmdline one.

The load and entry point is:
0x42208000

With the current setup I have this (I also added some debug log that
print what is actually passed to do decompress

DTB:0x42AED270 (0x00008BA7)
Uncompressing Linux...
40208000 
4220F10C done, booting the kernel.

Where 40208000 is the value of output_start and 4220F10C is input_data.

And I think this confirm that it's getting loaded in the wrong position
actually in reserved memory... But how this is possible??? Hope can
someone help me in this since I wasted the entire day with this and
didn't manage to make any progress... aside from having fun with the
head.S assembly code.

-- 
	Ansuel