[AMD Official Use Only - Internal Distribution Only]
This patch shluld fix the panic.
but I would like you do NOT add adev xgmi head to the local device list. if ras ue occurs while the gpu is already in gpu recovery.
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> on behalf of Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: Friday, April 17, 2020 5:17:22 PM
To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Clements, John <John.Clements@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU
Sent: Friday, April 17, 2020 5:17:22 PM
To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Clements, John <John.Clements@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU
Am 16.04.20 um 17:47 schrieb Guchun Chen:
> When running ras uncorrectable error injection and trigger GPU
> reset on sGPU, below issue is observed. It's caused by the list
> uninitialized when accessing.
>
> [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
> [ 80.047300] #PF: supervisor write access in kernel mode
> [ 80.047351] #PF: error_code(0x0003) - permissions violation
> [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061
> [ 80.047477] Oops: 0003 [#1] SMP PTI
> [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1
> [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
> [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
>
> Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b27d9d62c9df..260b4a42e0ae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1448,9 +1448,10 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
> struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false);
>
> /* Build list of devices to query RAS related errors */
> - if (hive && adev->gmc.xgmi.num_physical_nodes > 1) {
> + if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> device_list_handle = &hive->device_list;
> - } else {
> + else {
The coding style here is incorrect. If one branch of an if/else uses {}
the other(s) should use it as well.
> + INIT_LIST_HEAD(&device_list);
That was suggested before but then reverted, but I'm not sure why.
Regards,
Christian.
> list_add_tail(&adev->gmc.xgmi.head, &device_list);
> device_list_handle = &device_list;
> }
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url="">
> When running ras uncorrectable error injection and trigger GPU
> reset on sGPU, below issue is observed. It's caused by the list
> uninitialized when accessing.
>
> [ 80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
> [ 80.047300] #PF: supervisor write access in kernel mode
> [ 80.047351] #PF: error_code(0x0003) - permissions violation
> [ 80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061
> [ 80.047477] Oops: 0003 [#1] SMP PTI
> [ 80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G OE 5.4.0-rc7-guchchen #1
> [ 80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
> [ 80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
>
> Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index b27d9d62c9df..260b4a42e0ae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1448,9 +1448,10 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
> struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false);
>
> /* Build list of devices to query RAS related errors */
> - if (hive && adev->gmc.xgmi.num_physical_nodes > 1) {
> + if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> device_list_handle = &hive->device_list;
> - } else {
> + else {
The coding style here is incorrect. If one branch of an if/else uses {}
the other(s) should use it as well.
> + INIT_LIST_HEAD(&device_list);
That was suggested before but then reverted, but I'm not sure why.
Regards,
Christian.
> list_add_tail(&adev->gmc.xgmi.head, &device_list);
> device_list_handle = &device_list;
> }
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url="">
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx