Re: [PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()

AngeloGioacchino Del Regno <angelogioacchino.delregno@xxxxxxxxxxxxx> · Mon, 27 Nov 2023 12:26:52 +0100

Il 27/11/23 12:24, Marek Szyprowski ha scritto:
On 24.11.2023 13:45, Marek Szyprowski wrote:
On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
Hey Krzysztof,

This is interesting. It might be about the cores that are missing
from the partial
core_mask raising interrupts, but an external abort on
non-linefetch is strange to
see here.
I've seen such external aborts in the past, and the fault type has
often been misleading. It's unlikely to have anything to do with a
Yeah, often accessing device with power or clocks gated.

Except my commit does *not* gate SoC power, nor SoC clocks 🙂
It could be that something (like clocks or power supplies) was missing
on this board/SoC, which was not critical till your patch came.

What the "Really power off ..." commit does is to ask the GPU to
internally power
off the shaders, tilers and L2, that's why I say that it is strange
to see that
kind of abort.

The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and
GPU_FAULT_ADDRESS_{HI/LO}
registers should still be accessible even with shaders, tilers and
cache OFF.

Anyway, yes, synchronizing IRQs before calling the poweroff sequence
would also
work, but that'd add up quite a bit of latency on the
runtime_suspend() call, so
in this case I'd be more for avoiding to execute any register r/w in
the handler
by either checking if the GPU is supposed to be OFF, or clearing
interrupts, which
may not work if those are generated after the execution of the
poweroff function.
Or we could simply disable the irq after power_off, but that'd be
hacky (as well).

Let's see if asking to poweroff *everything* works:
Worked.

Yes, I also got into this issue some time ago, but I didn't report it
because I also had some power supply related problems on my test farm
and everything was a bit unstable. I wasn't 100% sure that the
$subject patch is responsible for the observed issues. Now, after
fixing power supply, I confirm that the issue was revealed by the
$subject patch and above mentioned change fixes the problem. Feel free
to add:

Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>

I must revoke my tested-by tag for the above fix alone. Although it
fixed the boot issue and system stability issue, it looks that there is
still something missing and opening the panfrost dri device causes a
system crash:

root@target:~# ./modetest -C
trying to open device 'i915'...failed
trying to open device 'amdgpu'...failed
trying to open device 'radeon'...failed
trying to open device 'nouveau'...failed
trying to open device 'vmwgfx'...failed
trying to open device 'omapdrm'...failed
trying to open device 'exynos'...done
root@target:~#

8<--- cut here ---
Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c
[f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453
Internal error: : 1008 [#1] PREEMPT SMP ARM
Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem
videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common
videodev mc s5p_cec
CPU: 0 PID: 0 Comm: swapper/0 Not tainted
6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649
Hardware name: Samsung Exynos (Flattened Device Tree)
PC is at panfrost_gpu_irq_handler+0x18/0xfc
LR is at __handle_irq_event_percpu+0xcc/0x31c
...
Process swapper/0 (pid: 0, stack limit = 0x0e2875ff)
Stack: (0xc1301e48 to 0xc1302000)
...
   panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
   __handle_irq_event_percpu from handle_irq_event+0x38/0x80
   handle_irq_event from handle_fasteoi_irq+0x9c/0x250
   handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34
   generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
   gic_handle_irq from generic_handle_arch_irq+0x34/0x44
   generic_handle_arch_irq from __irq_svc+0x8c/0xd0
Exception stack(0xc1301f10 to 0xc1301f58)
...
   __irq_svc from default_idle_call+0x20/0x2c4
   default_idle_call from do_idle+0x244/0x2b4
   do_idle from cpu_startup_entry+0x28/0x2c
   cpu_startup_entry from rest_init+0xec/0x190
   rest_init from arch_post_acpi_subsys_init+0x0/0x8
Code: e591300c e593402c f57ff04f e591300c (e593903c)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt
CPU2: stopping

It looks that the panfrost interrupts must be somehow synchronized with
turning power off, what has been already discussed. Let me know if you
want me to test any patch.

The new series containing the whole interrupts sync code is almost ready,
currently testing it on my machines here.

I should be able to send it between today and tomorrow.

Cheers,
Angelo