Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Tue, 4 Jan 2022 11:56:42 -0500

On 2022-01-04 6:36 a.m., Christian König wrote:
Am 04.01.22 um 11:49 schrieb Liu, Monk:
[AMD Official Use Only]

See the FLR request from the hypervisor is just another source of 
signaling the need for a reset, similar to each job timeout on each 
queue. Otherwise you have a race condition between the hypervisor 
and the scheduler.
No it's not, FLR from hypervisor is just to notify guest the hw VF 
FLR is about to start or was already executed, but host will do FLR 
anyway without waiting for guest too long


Then we have a major design issue in the SRIOV protocol and really 
need to question this.

How do you want to prevent a race between the hypervisor resetting the 
hardware and the client trying the same because of a timeout?

As far as I can see the procedure should be:
1. We detect that a reset is necessary, either because of a fault a 
timeout or signal from hypervisor.
2. For each of those potential reset sources a work item is send to 
the single workqueue.
3. One of those work items execute first and prepares the reset.
4. We either do the reset our self or notify the hypervisor that we 
are ready for the reset.
5. Cleanup after the reset, eventually resubmit jobs etc..
6. Cancel work items which might have been scheduled from other reset 
sources.

It does make sense that the hypervisor resets the hardware without 
waiting for the clients for too long, but if we don't follow this 
general steps we will always have a race between the different 
components.


Monk, just to add to this - if indeed as you say that 'FLR from 
hypervisor is just to notify guest the hw VF FLR is about to start or 
was already executed, but host will do FLR anyway without waiting for 
guest too long'
and there is no strict waiting from the hypervisor for 
IDH_READY_TO_RESET to be recived from guest before starting the reset 
then setting in_gpu_reset and locking reset_sem from guest side is not 
really full proof
protection from MMIO accesses by the guest - it only truly helps if 
hypervisor waits for that message before initiation of HW reset.

Andrey



Regards,
Christian.

Am 04.01.22 um 11:49 schrieb Liu, Monk:
[AMD Official Use Only]

See the FLR request from the hypervisor is just another source of 
signaling the need for a reset, similar to each job timeout on each 
queue. Otherwise you have a race condition between the hypervisor 
and the scheduler.
No it's not, FLR from hypervisor is just to notify guest the hw VF 
FLR is about to start or was already executed, but host will do FLR 
anyway without waiting for guest too long

In other words I strongly think that the current SRIOV reset 
implementation is severely broken and what Andrey is doing is 
actually fixing it.
It makes the code to crash ... how could it be a fix ?

I'm afraid the patch is NAK from me,  but it is welcome if the 
cleanup do not ruin the logic, Andry or jingwen can try it if needed.

Thanks
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Solution | AMD
-------------------------------------------------------------------
we are hiring software manager for CVS core team
-------------------------------------------------------------------

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Tuesday, January 4, 2022 6:19 PM
To: Chen, JingWen <JingWen.Chen2@xxxxxxx>; Christian König 
<ckoenig.leichtzumerken@xxxxxxxxx>; Grodzovsky, Andrey 
<Andrey.Grodzovsky@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>; Liu, 
Monk <Monk.Liu@xxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; 
amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Chen, Horace <Horace.Chen@xxxxxxx>; 
Chen, JingWen <JingWen.Chen2@xxxxxxx>
Cc: daniel@xxxxxxxx
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
protection for SRIOV

Hi Jingwen,

well what I mean is that we need to adjust the implementation in 
amdgpu to actually match the requirements.

Could be that the reset sequence is questionable in general, but I 
doubt so at least for now.

See the FLR request from the hypervisor is just another source of 
signaling the need for a reset, similar to each job timeout on each 
queue. Otherwise you have a race condition between the hypervisor and 
the scheduler.

Properly setting in_gpu_reset is indeed mandatory, but should happen 
at a central place and not in the SRIOV specific code.

In other words I strongly think that the current SRIOV reset 
implementation is severely broken and what Andrey is doing is 
actually fixing it.

Regards,
Christian.

Am 04.01.22 um 10:07 schrieb JingWen Chen:
Hi Christian,
I'm not sure what do you mean by "we need to change SRIOV not the 
driver".

Do you mean we should change the reset sequence in SRIOV? This will 
be a huge change for our SRIOV solution.

  From my point of view, we can directly use amdgpu_device_lock_adev
and amdgpu_device_unlock_adev in flr_work instead of try_lock since 
no one will conflict with this thread with reset_domain introduced.
But we do need the reset_sem and adev->in_gpu_reset to keep device 
untouched via user space.

Best Regards,
Jingwen Chen

On 2022/1/3 下午6:17, Christian König wrote:
Please don't. This patch is vital to the cleanup of the reset 
procedure.

If SRIOV doesn't work with that we need to change SRIOV and not the 
driver.

Christian.

Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
Sure, I guess i can drop this patch then.

Andrey

On 2021-12-24 4:57 a.m., JingWen Chen wrote:
I do agree with shaoyun, if the host find the gpu engine hangs 
first, and do the flr, guest side thread may not know this and 
still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset 
and reset_sem to identify the reset status). And this may lead to 
very bad result.

On 2021/12/24 下午4:58, Deng, Emily wrote:
These patches look good to me. JingWen will pull these patches 
and do some basic TDR test on sriov environment, and give feedback.

Best wishes
Emily Deng



-----Original Message-----
From: Liu, Monk <Monk.Liu@xxxxxxx>
Sent: Thursday, December 23, 2021 6:14 PM
To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Grodzovsky,
Andrey <Andrey.Grodzovsky@xxxxxxx>;
dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd- gfx@xxxxxxxxxxxxxxxxxxxxx;
Chen, Horace <Horace.Chen@xxxxxxx>; Chen, JingWen
<JingWen.Chen2@xxxxxxx>; Deng, Emily <Emily.Deng@xxxxxxx>
Cc: daniel@xxxxxxxx
Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
protection for SRIOV

[AMD Official Use Only]

@Chen, Horace @Chen, JingWen @Deng, Emily

Please take a review on Andrey's patch

Thanks
-----------------------------------------------------------------
-- Monk Liu | Cloud GPU & Virtualization Solution | AMD
-----------------------------------------------------------------
-- we are hiring software manager for CVS core team
-----------------------------------------------------------------
--

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Thursday, December 23, 2021 4:42 PM
To: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; dri-
devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: daniel@xxxxxxxx; Liu, Monk <Monk.Liu@xxxxxxx>; Chen, Horace
<Horace.Chen@xxxxxxx>
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset
protection for SRIOV

Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
Since now flr work is serialized against  GPU resets there is no
need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Acked-by: Christian König <christian.koenig@xxxxxxx>

---
     drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
     drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
     2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
work_struct *work)
         struct amdgpu_device *adev = container_of(virt, struct
amdgpu_device, virt);
         int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;

-    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-     * otherwise the mailbox msg will be ruined/reseted by
-     * the VF FLR.
-     */
-    if (!down_write_trylock(&adev->reset_sem))
-        return;
-
         amdgpu_virt_fini_data_exchange(adev);
-    atomic_set(&adev->in_gpu_reset, 1);

         xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
0, 0);

@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
work_struct *work)
         } while (timeout > 1);

     flr_done:
-    atomic_set(&adev->in_gpu_reset, 0);
-    up_write(&adev->reset_sem);
-
         /* Trigger recovery for world switch failure if no TDR
*/
         if (amdgpu_device_should_recover_gpu(adev)
             && (!amdgpu_device_has_job_running(adev) || diff
--git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
work_struct *work)
         struct amdgpu_device *adev = container_of(virt, struct
amdgpu_device, virt);
         int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;

-    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-     * otherwise the mailbox msg will be ruined/reseted by
-     * the VF FLR.
-     */
-    if (!down_write_trylock(&adev->reset_sem))
-        return;
-
         amdgpu_virt_fini_data_exchange(adev);
-    atomic_set(&adev->in_gpu_reset, 1);

         xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0,
0, 0);

@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
work_struct *work)
         } while (timeout > 1);

     flr_done:
-    atomic_set(&adev->in_gpu_reset, 0);
-    up_write(&adev->reset_sem);
-
         /* Trigger recovery for world switch failure if no TDR
*/
         if (amdgpu_device_should_recover_gpu(adev)
             && (!amdgpu_device_has_job_running(adev) ||