On 24.11.2023 13:45, Marek Szyprowski wrote: > On 22.11.2023 10:29, Krzysztof Kozlowski wrote: >> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote: >>>>>> Hey Krzysztof, >>>>>> >>>>>> This is interesting. It might be about the cores that are missing >>>>>> from the partial >>>>>> core_mask raising interrupts, but an external abort on >>>>>> non-linefetch is strange to >>>>>> see here. >>>>> I've seen such external aborts in the past, and the fault type has >>>>> often been misleading. It's unlikely to have anything to do with a >>>> Yeah, often accessing device with power or clocks gated. >>>> >>> Except my commit does *not* gate SoC power, nor SoC clocks 🙂 >> It could be that something (like clocks or power supplies) was missing >> on this board/SoC, which was not critical till your patch came. >> >>> What the "Really power off ..." commit does is to ask the GPU to >>> internally power >>> off the shaders, tilers and L2, that's why I say that it is strange >>> to see that >>> kind of abort. >>> >>> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and >>> GPU_FAULT_ADDRESS_{HI/LO} >>> registers should still be accessible even with shaders, tilers and >>> cache OFF. >>> >>> Anyway, yes, synchronizing IRQs before calling the poweroff sequence >>> would also >>> work, but that'd add up quite a bit of latency on the >>> runtime_suspend() call, so >>> in this case I'd be more for avoiding to execute any register r/w in >>> the handler >>> by either checking if the GPU is supposed to be OFF, or clearing >>> interrupts, which >>> may not work if those are generated after the execution of the >>> poweroff function. >>> Or we could simply disable the irq after power_off, but that'd be >>> hacky (as well). >>> >>> >>> Let's see if asking to poweroff *everything* works: >> Worked. > > Yes, I also got into this issue some time ago, but I didn't report it > because I also had some power supply related problems on my test farm > and everything was a bit unstable. I wasn't 100% sure that the > $subject patch is responsible for the observed issues. Now, after > fixing power supply, I confirm that the issue was revealed by the > $subject patch and above mentioned change fixes the problem. Feel free > to add: > > Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> I must revoke my tested-by tag for the above fix alone. Although it fixed the boot issue and system stability issue, it looks that there is still something missing and opening the panfrost dri device causes a system crash: root@target:~# ./modetest -C trying to open device 'i915'...failed trying to open device 'amdgpu'...failed trying to open device 'radeon'...failed trying to open device 'nouveau'...failed trying to open device 'vmwgfx'...failed trying to open device 'omapdrm'...failed trying to open device 'exynos'...done root@target:~# 8<--- cut here --- Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c [f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453 Internal error: : 1008 [#1] PREEMPT SMP ARM Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc s5p_cec CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649 Hardware name: Samsung Exynos (Flattened Device Tree) PC is at panfrost_gpu_irq_handler+0x18/0xfc LR is at __handle_irq_event_percpu+0xcc/0x31c ... Process swapper/0 (pid: 0, stack limit = 0x0e2875ff) Stack: (0xc1301e48 to 0xc1302000) ... panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c __handle_irq_event_percpu from handle_irq_event+0x38/0x80 handle_irq_event from handle_fasteoi_irq+0x9c/0x250 handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8 gic_handle_irq from generic_handle_arch_irq+0x34/0x44 generic_handle_arch_irq from __irq_svc+0x8c/0xd0 Exception stack(0xc1301f10 to 0xc1301f58) ... __irq_svc from default_idle_call+0x20/0x2c4 default_idle_call from do_idle+0x244/0x2b4 do_idle from cpu_startup_entry+0x28/0x2c cpu_startup_entry from rest_init+0xec/0x190 rest_init from arch_post_acpi_subsys_init+0x0/0x8 Code: e591300c e593402c f57ff04f e591300c (e593903c) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception in interrupt CPU2: stopping It looks that the panfrost interrupts must be somehow synchronized with turning power off, what has been already discussed. Let me know if you want me to test any patch. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland