Re: Kernel crash at reloading amdgpu

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The attached patch should fix it.

Alex


From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Lin, Amber <Amber.Lin@xxxxxxx>
Sent: Wednesday, May 8, 2019 4:56 PM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Kernel crash at reloading amdgpu
 
[CAUTION: External Email]

Hi,

When I do "rmmod amdgpu; modprobe amdgpu", kernel crashed. This is
vega20. What happens is in amdgpu_device_init():


         /* check if we need to reset the asic
          *  E.g., driver was not cleanly unloaded previously, etc.
          */
         if (!amdgpu_sriov_vf(adev) &&
amdgpu_asic_need_reset_on_init(adev)) {
                 r = amdgpu_asic_reset(adev);
                 if (r) {
                         dev_err(adev->dev, "asic reset on init failed\n");
                         goto failed;
                 }
         }

amdgpu_asic_need_reset_on_init()/soc15_need_reset_on_init() returns true
and it goes to amdgpu_asic_reset()/soc15_asic_mode1_reset(), where it
calls psp_gpu_reset():

         int psp_gpu_reset(struct amdgpu_device *adev)
         {
             if (adev->firmware.load_type != AMDGPU_FW_LOAD_PSP)
                     return 0;

             return psp_mode1_reset(&adev->psp);
         }

Here, however, psp_mode1_reset is NOT assigned as
psp_v11_0_mode1_reset() until amdgpu_device_ip_init(), which is after
amdgpu_asic_reset. This null function pointer causes the kernel crash
and I have to reboot my system.

Does anyone have an idea how to fix this properly?

BTW this is the log:

[  157.686303] PGD 0 P4D 0
[  157.688837] Oops: 0000 [#1] SMP PTI
[  157.692331] CPU: 0 PID: 1902 Comm: kworker/0:2 Tainted: G W
5.0.0-rc1-kfd+ #6
[  157.700760] Hardware name: ASUS All Series/X99-E WS, BIOS 1302 01/05/2016
[  157.707543] Workqueue: events work_for_cpu_fn
[  157.711976] RIP: 0010:psp_gpu_reset+0x18/0x30 [amdgpu]
[  157.717106] Code: ff ff ff 5b c3 b8 ea ff ff ff c3 0f 1f 80 00 00 00
00 0f 1f 44 00 00 83 bf c8 22 01 00 02 74 03 31 c0 c3 48 8b 87 c0 23 01
00 <48> 8b 40 50 48 85 c0 74 ed 48 81 c7 88 23 01 00 e9 03 3b 8d d6 0f
[  157.735852] RSP: 0018:ffffaa2544243ce0 EFLAGS: 00010246
[  157.741077] RAX: 0000000000000000 RBX: ffff97e946f60000 RCX:
0000000000000000
[  157.748202] RDX: 0000000000000027 RSI: ffffffff976655a0 RDI:
ffff97e946f60000
[  157.755326] RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000002
[  157.762459] R10: ffffaa2544243ba0 R11: 38a79ac3ec19edd5 R12:
ffff97e946f75088
[  157.769608] R13: 000000000000000a R14: ffff97e946f75128 R15:
0000000000000001
[  157.776741] FS:  0000000000000000(0000) GS:ffff97e94f800000(0000)
knlGS:0000000000000000
[  157.784827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  157.790564] CR2: 0000000000000050 CR3: 00000008083e6003 CR4:
00000000001606f0
[  157.797696] Call Trace:
[  157.800184]  soc15_asic_reset+0x81/0x1f0 [amdgpu]
[  157.804936]  amdgpu_device_init+0xcf1/0x1800 [amdgpu]
[  157.809993]  ? rcu_read_lock_sched_held+0x74/0x80
[  157.814734]  amdgpu_driver_load_kms+0x65/0x270 [amdgpu]

Thanks.

Regards,
Amber
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
From e16ca183aa04fcfb46828d0824336bc51c4f44c8 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Wed, 8 May 2019 21:45:06 -0500
Subject: [PATCH] drm/amdgpu/psp: move psp version specific function pointers
 to early_init

In case we need to use them for GPU reset prior initializing the
asic.  Fixes a crash if the driver attempts to reset the GPU at driver
load time.

Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 905cce1814f3..05897b05766b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -38,18 +38,10 @@ static void psp_set_funcs(struct amdgpu_device *adev);
 static int psp_early_init(void *handle)
 {
 	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+	struct psp_context *psp = &adev->psp;
 
 	psp_set_funcs(adev);
 
-	return 0;
-}
-
-static int psp_sw_init(void *handle)
-{
-	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
-	struct psp_context *psp = &adev->psp;
-	int ret;
-
 	switch (adev->asic_type) {
 	case CHIP_VEGA10:
 	case CHIP_VEGA12:
@@ -67,6 +59,15 @@ static int psp_sw_init(void *handle)
 
 	psp->adev = adev;
 
+	return 0;
+}
+
+static int psp_sw_init(void *handle)
+{
+	struct amdgpu_device *adev = (struct amdgpu_device *)handle;
+	struct psp_context *psp = &adev->psp;
+	int ret;
+
 	ret = psp_init_microcode(psp);
 	if (ret) {
 		DRM_ERROR("Failed to load psp firmware!\n");
-- 
2.20.1

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux