Re: amdgpu display corruption and hang on AMD A10-9620P

Carlo Caione <carlo@xxxxxxxxxxxx> · Thu, 15 Jun 2017 08:46:16 +0200

On Mon, Jun 12, 2017 at 12:24 PM, Carlo Caione <carlo@xxxxxxxxxxxx> wrote:
> On Tue, May 9, 2017 at 7:03 PM, Deucher, Alexander
> <Alexander.Deucher@xxxxxxx> wrote:
>>> -----Original Message-----
>>> From: Daniel Drake [mailto:drake@xxxxxxxxxxxx]
>>> Sent: Tuesday, May 09, 2017 12:55 PM
>>> To: dri-devel; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander
>>> Cc: Chris Chiu; Linux Upstreaming Team
>>> Subject: amdgpu display corruption and hang on AMD A10-9620P
>>>
>>> Hi,
>>>
>>> We are working with new laptops that have the AMD Bristol Ridge
>>> chipset with this SoC:
>>>
>>> AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G
>>>
>>> I think this is the Bristol Ridge chipset.
>>>
>>> During boot, the display becomes unusable at the point where the
>>> amdgpu driver loads. You can see at least two horizontal lines of
>>> garbage at this point. We have reproduced on 4.8, 4.10 and linus
>>> master (early 4.12).
>>>
>>> Photo: http://pasteboard.co/qrC9mh4p.jpg
>>>
>>> Getting logs is tricky because the system appears to freeze at that point.
>>>
>>> Is this a known issue? Anything we can do to help diagnosis?
>>
>> I'm not aware of any specific issues.  Please file a bug and attach your logs (https://bugs.freedesktop.org) along with information about the system.
>
> Opened https://bugs.freedesktop.org/show_bug.cgi?id=101387 to trace
> this bug. I also have attached there the full log we get when
> modprobing amdgpu.
> Reporting here only the trace for the sake of documentation (full log
> attached to the bug opened on freedesktop)
>
> [   80.766937] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffc0c88942
> [   80.766937]
> [   80.766408] Kernel panic - not syncing: stack-protector: Kernel
> stack is corrupted in: ffffffffc0c88942
> [   80.766408]
> [   80.766428] CPU: 1 PID: 1594 Comm: modprobe Not tainted 4.11.3+ #2
> [   80.766431] Hardware name: Acer Aspire A515-41G/Wartortle_BS, BIOS
> V0.09 04/19/2017
> [   80.766434] Call Trace:
> [   80.766445]  dump_stack+0x63/0x90
> [   80.766451]  panic+0xe8/0x236
> [   80.766526]  ? amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
> [   80.766537]  __stack_chk_fail+0x1b/0x20
> [   80.766571]  amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
> [   80.766610]  dce_v11_0_hw_init+0x3e/0x2d0 [amdgpu]
> [   80.766643]  amdgpu_device_init+0xe23/0x13c0 [amdgpu]
> [   80.766647]  ? kmalloc_order+0x18/0x40
> [   80.766650]  ? kmalloc_order_trace+0x24/0xa0
> [   80.766683]  amdgpu_driver_load_kms+0x5d/0x240 [amdgpu]
> [   80.766708]  drm_dev_register+0x148/0x1e0 [drm]
> [   80.766721]  drm_get_pci_dev+0xa0/0x160 [drm]
> [   80.766754]  amdgpu_pci_probe+0xb9/0xf0 [amdgpu]
> [   80.766759]  local_pci_probe+0x45/0xa0
> [   80.766762]  pci_device_probe+0xf4/0x150
> [   80.766768]  driver_probe_device+0x2c5/0x470
> [   80.766772]  __driver_attach+0xdf/0xf0
> [   80.766776]  ? driver_probe_device+0x470/0x470
> [   80.766780]  bus_for_each_dev+0x6c/0xc0
> [   80.766784]  driver_attach+0x1e/0x20
> [   80.766787]  bus_add_driver+0x45/0x270
> [   80.766790]  ? 0xffffffffc09a8000
> [   80.766794]  driver_register+0x60/0xe0
> [   80.766796]  ? 0xffffffffc09a8000
> [   80.766799]  __pci_register_driver+0x4c/0x50
> [   80.766811]  drm_pci_init+0xed/0x100 [drm]
> [   80.766816]  ? vga_switcheroo_register_handler+0x6c/0x90
> [   80.766819]  ? 0xffffffffc09a8000
> [   80.766850]  amdgpu_init+0x9b/0xac [amdgpu]
> [   80.766855]  do_one_initcall+0x53/0x1c0
> [   80.766860]  ? __vunmap+0x81/0xd0
> [   80.766865]  ? kmem_cache_alloc_trace+0xdb/0x1b0
> [   80.766868]  ? kfree+0x161/0x170
> [   80.766876]  do_init_module+0x60/0x202
> [   80.766881]  load_module+0x2612/0x29f0
> [   80.766885]  SYSC_finit_module+0xa6/0xf0
> [   80.766888]  ? SYSC_finit_module+0xa6/0xf0
> [   80.766892]  SyS_finit_module+0xe/0x10
> [   80.766896]  entry_SYSCALL_64_fastpath+0x1e/0xad
> [   80.766899] RIP: 0033:0x7fa525e60709
> [   80.766902] RSP: 002b:00007fff2f5bbbf8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [   80.766905] RAX: ffffffffffffffda RBX: 00007fa526129760 RCX: 00007fa525e60709
> [   80.766908] RDX: 0000000000000000 RSI: 000055f51f1c9439 RDI: 000000000000000b
> [   80.766910] RBP: 0000000000000070 R08: 0000000000000000 R09: 000055f51fcd83f0
> [   80.766913] R10: 000000000000000b R11: 0000000000000246 R12: 000055f51fcd9ff0
> [   80.766915] R13: 0000000000000007 R14: 00007fa5261297b8 R15: 0000000000002710
> [   80.766931] Kernel Offset: 0x22800000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [   80.766937] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffc0c88942

Trying to move this discussion here for more visibility. This is what
is happening.

In amdgpu_atombios_crtc_powergate_init() we are declaring
ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args as parameter space, this
is 32bytes wide and passed down to the atombios interpreter in
ctx->ps.

When amdgpu_atombios_crtc_powergate_init() is called this triggers the
parsing of the command table with index == 13 [>> execute C5C0 (len
589, WS 0, PS 0)]. During the execution of this table several
CALL_TABLE (op == 82) are executed. More in detail we first jump to
table with index == 78 [>> execute F166 (len 588, WS 0, PS 8)], then
to table with index == 51 [>> execute F446 (len 465, WS 4, PS 4)] and
to table with index == 75 [>> execute F6CC (len 1330, WS 4, PS 0)]
before finally reaching the EOT for table 13. At this point when
returning in amdgpu_atombios_crtc_powergate_init() the stack is
already corrupted.

The corruption is happening during the execution of the code in the
table 75 [>> execute F6CC (len 1330, WS 4, PS 0)]. In this table a
MOVE_PS is executed with a destination index == 1, accessing
ctx->ps[idx] and causing the stack corruption.

My first guess here is that something is wrong in the atombios code.
Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the
table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS,
etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess
what) are the instructions causing the stack corruption. My guess here
is that the opcodes *_PS in the atombios are wrong and they should
actually be *_WS opcodes.

Another possibility is that the atombios interpreter is doing
something wrong. Don't we need to allocate the size of the ps
allocation struct (ctx->ps) for the command table we are going to
execute after a CALL_TABLE matching the ps size in the table header?
IIUC the code in the kernel, when we are jumping to a different table
ctx->ps is not being reallocated.

Thanks,

-- 
Carlo Caione  |  +39.340.80.30.096  |  Endless
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel