The attached patch should fix it.
Alex
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Lin, Amber <Amber.Lin@xxxxxxx>
Sent: Wednesday, May 8, 2019 4:56 PM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Kernel crash at reloading amdgpu [CAUTION: External Email]
Hi, When I do "rmmod amdgpu; modprobe amdgpu", kernel crashed. This is vega20. What happens is in amdgpu_device_init(): /* check if we need to reset the asic * E.g., driver was not cleanly unloaded previously, etc. */ if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) { r = amdgpu_asic_reset(adev); if (r) { dev_err(adev->dev, "asic reset on init failed\n"); goto failed; } } amdgpu_asic_need_reset_on_init()/soc15_need_reset_on_init() returns true and it goes to amdgpu_asic_reset()/soc15_asic_mode1_reset(), where it calls psp_gpu_reset(): int psp_gpu_reset(struct amdgpu_device *adev) { if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP) return 0; return psp_mode1_reset(&adev->psp); } Here, however, psp_mode1_reset is NOT assigned as psp_v11_0_mode1_reset() until amdgpu_device_ip_init(), which is after amdgpu_asic_reset. This null function pointer causes the kernel crash and I have to reboot my system. Does anyone have an idea how to fix this properly? BTW this is the log: [ 157.686303] PGD 0 P4D 0 [ 157.688837] Oops: 0000 [#1] SMP PTI [ 157.692331] CPU: 0 PID: 1902 Comm: kworker/0:2 Tainted: G W 5.0.0-rc1-kfd+ #6 [ 157.700760] Hardware name: ASUS All Series/X99-E WS, BIOS 1302 01/05/2016 [ 157.707543] Workqueue: events work_for_cpu_fn [ 157.711976] RIP: 0010:psp_gpu_reset+0x18/0x30 [amdgpu] [ 157.717106] Code: ff ff ff 5b c3 b8 ea ff ff ff c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 83 bf c8 22 01 00 02 74 03 31 c0 c3 48 8b 87 c0 23 01 00 <48> 8b 40 50 48 85 c0 74 ed 48 81 c7 88 23 01 00 e9 03 3b 8d d6 0f [ 157.735852] RSP: 0018:ffffaa2544243ce0 EFLAGS: 00010246 [ 157.741077] RAX: 0000000000000000 RBX: ffff97e946f60000 RCX: 0000000000000000 [ 157.748202] RDX: 0000000000000027 RSI: ffffffff976655a0 RDI: ffff97e946f60000 [ 157.755326] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002 [ 157.762459] R10: ffffaa2544243ba0 R11: 38a79ac3ec19edd5 R12: ffff97e946f75088 [ 157.769608] R13: 000000000000000a R14: ffff97e946f75128 R15: 0000000000000001 [ 157.776741] FS: 0000000000000000(0000) GS:ffff97e94f800000(0000) knlGS:0000000000000000 [ 157.784827] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 157.790564] CR2: 0000000000000050 CR3: 00000008083e6003 CR4: 00000000001606f0 [ 157.797696] Call Trace: [ 157.800184] soc15_asic_reset+0x81/0x1f0 [amdgpu] [ 157.804936] amdgpu_device_init+0xcf1/0x1800 [amdgpu] [ 157.809993] ? rcu_read_lock_sched_held+0x74/0x80 [ 157.814734] amdgpu_driver_load_kms+0x65/0x270 [amdgpu] Thanks. Regards, Amber _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx |
From e16ca183aa04fcfb46828d0824336bc51c4f44c8 Mon Sep 17 00:00:00 2001 From: Alex Deucher <alexander.deucher@xxxxxxx> Date: Wed, 8 May 2019 21:45:06 -0500 Subject: [PATCH] drm/amdgpu/psp: move psp version specific function pointers to early_init In case we need to use them for GPU reset prior initializing the asic. Fixes a crash if the driver attempts to reset the GPU at driver load time. Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 905cce1814f3..05897b05766b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -38,18 +38,10 @@ static void psp_set_funcs(struct amdgpu_device *adev); static int psp_early_init(void *handle) { struct amdgpu_device *adev = (struct amdgpu_device *)handle; + struct psp_context *psp = &adev->psp; psp_set_funcs(adev); - return 0; -} - -static int psp_sw_init(void *handle) -{ - struct amdgpu_device *adev = (struct amdgpu_device *)handle; - struct psp_context *psp = &adev->psp; - int ret; - switch (adev->asic_type) { case CHIP_VEGA10: case CHIP_VEGA12: @@ -67,6 +59,15 @@ static int psp_sw_init(void *handle) psp->adev = adev; + return 0; +} + +static int psp_sw_init(void *handle) +{ + struct amdgpu_device *adev = (struct amdgpu_device *)handle; + struct psp_context *psp = &adev->psp; + int ret; + ret = psp_init_microcode(psp); if (ret) { DRM_ERROR("Failed to load psp firmware!\n"); -- 2.20.1
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx