On Mon, Jun 12, 2017 at 12:24 PM, Carlo Caione <carlo at endlessm.com> wrote: > On Tue, May 9, 2017 at 7:03 PM, Deucher, Alexander > <Alexander.Deucher at amd.com> wrote: >>> -----Original Message----- >>> From: Daniel Drake [mailto:drake at endlessm.com] >>> Sent: Tuesday, May 09, 2017 12:55 PM >>> To: dri-devel; amd-gfx at lists.freedesktop.org; Deucher, Alexander >>> Cc: Chris Chiu; Linux Upstreaming Team >>> Subject: amdgpu display corruption and hang on AMD A10-9620P >>> >>> Hi, >>> >>> We are working with new laptops that have the AMD Bristol Ridge >>> chipset with this SoC: >>> >>> AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G >>> >>> I think this is the Bristol Ridge chipset. >>> >>> During boot, the display becomes unusable at the point where the >>> amdgpu driver loads. You can see at least two horizontal lines of >>> garbage at this point. We have reproduced on 4.8, 4.10 and linus >>> master (early 4.12). >>> >>> Photo: http://pasteboard.co/qrC9mh4p.jpg >>> >>> Getting logs is tricky because the system appears to freeze at that point. >>> >>> Is this a known issue? Anything we can do to help diagnosis? >> >> I'm not aware of any specific issues. Please file a bug and attach your logs (https://bugs.freedesktop.org) along with information about the system. > > Opened https://bugs.freedesktop.org/show_bug.cgi?id=101387 to trace > this bug. I also have attached there the full log we get when > modprobing amdgpu. > Reporting here only the trace for the sake of documentation (full log > attached to the bug opened on freedesktop) > > [ 80.766937] ---[ end Kernel panic - not syncing: stack-protector: > Kernel stack is corrupted in: ffffffffc0c88942 > [ 80.766937] > [ 80.766408] Kernel panic - not syncing: stack-protector: Kernel > stack is corrupted in: ffffffffc0c88942 > [ 80.766408] > [ 80.766428] CPU: 1 PID: 1594 Comm: modprobe Not tainted 4.11.3+ #2 > [ 80.766431] Hardware name: Acer Aspire A515-41G/Wartortle_BS, BIOS > V0.09 04/19/2017 > [ 80.766434] Call Trace: > [ 80.766445] dump_stack+0x63/0x90 > [ 80.766451] panic+0xe8/0x236 > [ 80.766526] ? amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu] > [ 80.766537] __stack_chk_fail+0x1b/0x20 > [ 80.766571] amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu] > [ 80.766610] dce_v11_0_hw_init+0x3e/0x2d0 [amdgpu] > [ 80.766643] amdgpu_device_init+0xe23/0x13c0 [amdgpu] > [ 80.766647] ? kmalloc_order+0x18/0x40 > [ 80.766650] ? kmalloc_order_trace+0x24/0xa0 > [ 80.766683] amdgpu_driver_load_kms+0x5d/0x240 [amdgpu] > [ 80.766708] drm_dev_register+0x148/0x1e0 [drm] > [ 80.766721] drm_get_pci_dev+0xa0/0x160 [drm] > [ 80.766754] amdgpu_pci_probe+0xb9/0xf0 [amdgpu] > [ 80.766759] local_pci_probe+0x45/0xa0 > [ 80.766762] pci_device_probe+0xf4/0x150 > [ 80.766768] driver_probe_device+0x2c5/0x470 > [ 80.766772] __driver_attach+0xdf/0xf0 > [ 80.766776] ? driver_probe_device+0x470/0x470 > [ 80.766780] bus_for_each_dev+0x6c/0xc0 > [ 80.766784] driver_attach+0x1e/0x20 > [ 80.766787] bus_add_driver+0x45/0x270 > [ 80.766790] ? 0xffffffffc09a8000 > [ 80.766794] driver_register+0x60/0xe0 > [ 80.766796] ? 0xffffffffc09a8000 > [ 80.766799] __pci_register_driver+0x4c/0x50 > [ 80.766811] drm_pci_init+0xed/0x100 [drm] > [ 80.766816] ? vga_switcheroo_register_handler+0x6c/0x90 > [ 80.766819] ? 0xffffffffc09a8000 > [ 80.766850] amdgpu_init+0x9b/0xac [amdgpu] > [ 80.766855] do_one_initcall+0x53/0x1c0 > [ 80.766860] ? __vunmap+0x81/0xd0 > [ 80.766865] ? kmem_cache_alloc_trace+0xdb/0x1b0 > [ 80.766868] ? kfree+0x161/0x170 > [ 80.766876] do_init_module+0x60/0x202 > [ 80.766881] load_module+0x2612/0x29f0 > [ 80.766885] SYSC_finit_module+0xa6/0xf0 > [ 80.766888] ? SYSC_finit_module+0xa6/0xf0 > [ 80.766892] SyS_finit_module+0xe/0x10 > [ 80.766896] entry_SYSCALL_64_fastpath+0x1e/0xad > [ 80.766899] RIP: 0033:0x7fa525e60709 > [ 80.766902] RSP: 002b:00007fff2f5bbbf8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000139 > [ 80.766905] RAX: ffffffffffffffda RBX: 00007fa526129760 RCX: 00007fa525e60709 > [ 80.766908] RDX: 0000000000000000 RSI: 000055f51f1c9439 RDI: 000000000000000b > [ 80.766910] RBP: 0000000000000070 R08: 0000000000000000 R09: 000055f51fcd83f0 > [ 80.766913] R10: 000000000000000b R11: 0000000000000246 R12: 000055f51fcd9ff0 > [ 80.766915] R13: 0000000000000007 R14: 00007fa5261297b8 R15: 0000000000002710 > [ 80.766931] Kernel Offset: 0x22800000 from 0xffffffff81000000 > (relocation range: 0xffffffff80000000-0xffffffffbfffffff) > [ 80.766937] ---[ end Kernel panic - not syncing: stack-protector: > Kernel stack is corrupted in: ffffffffc0c88942 Trying to move this discussion here for more visibility. This is what is happening. In amdgpu_atombios_crtc_powergate_init() we are declaring ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args as parameter space, this is 32bytes wide and passed down to the atombios interpreter in ctx->ps. When amdgpu_atombios_crtc_powergate_init() is called this triggers the parsing of the command table with index == 13 [>> execute C5C0 (len 589, WS 0, PS 0)]. During the execution of this table several CALL_TABLE (op == 82) are executed. More in detail we first jump to table with index == 78 [>> execute F166 (len 588, WS 0, PS 8)], then to table with index == 51 [>> execute F446 (len 465, WS 4, PS 4)] and to table with index == 75 [>> execute F6CC (len 1330, WS 4, PS 0)] before finally reaching the EOT for table 13. At this point when returning in amdgpu_atombios_crtc_powergate_init() the stack is already corrupted. The corruption is happening during the execution of the code in the table 75 [>> execute F6CC (len 1330, WS 4, PS 0)]. In this table a MOVE_PS is executed with a destination index == 1, accessing ctx->ps[idx] and causing the stack corruption. My first guess here is that something is wrong in the atombios code. Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS, etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess what) are the instructions causing the stack corruption. My guess here is that the opcodes *_PS in the atombios are wrong and they should actually be *_WS opcodes. Another possibility is that the atombios interpreter is doing something wrong. Don't we need to allocate the size of the ps allocation struct (ctx->ps) for the command table we are going to execute after a CALL_TABLE matching the ps size in the table header? IIUC the code in the kernel, when we are jumping to a different table ctx->ps is not being reallocated. Thanks, -- Carlo Caione | +39.340.80.30.096 | Endless