Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

Alex Deucher <alexdeucher@xxxxxxxxx> · Thu, 23 May 2024 14:14:48 -0400



On Thu, May 23, 2024 at 11:32 AM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy):
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >>> +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> >>> +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> >>> +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> >>> +amdgpu_end_hw_access(struct amdgpu_device *adev);
> >> Don't add anything to amdgpu.h. We are slowly decomposing that file.
> > Where would be a better place? I just wanted to have them next to amdgpu_in_reset
>
> amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over
> there as well.
>
> >
> >>> @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >>>              goto skip_hw_reset;
> >>>      }
> >>>
> >>> +   amdgpu_lock_hw_access(adev);
> >> That should already be locked here. So this will most likely deadlock.
> >>
> >>>    retry:    /* Rest of adevs pre asic reset from XGMI hive. */
> >>>      list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> >>>              r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
> >>> -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >>>               */
> >>>              amdgpu_device_stop_pending_resets(tmp_adev);
> >>>      }
> >>> +   amdgpu_unlock_hw_access(adev);
> >>>
> >>>    skip_hw_reset:
> >> Don't add helpers for that, especially not with that name.
> >>
> >> We don't block HW access, but just prevent concurrent resets.
> > Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path.
>
> Exactly that is what you should *not* do. Please don't add any new lock
> to the code. This is already complicated enough.
>
> If you think that adding wrappers for reset lock makes sense then we can
> probably do that, bot don't add any lock for hw access.
>
>
> >
> >>>      uint32_t seq;
> >>>
> >>> -   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> >>> -       !down_read_trylock(&adev->reset_domain->sem)) {
> >>> +   /*
> >>> +   * A GPU reset should flush all TLBs anyway, so no need to do
> >>> +   * this while one is ongoing.
> >>> +   */
> >>> +   if (!amdgpu_begin_hw_access(adev))
> >>> +           return 0;
> >>>
> >>> +   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> >>> +       amdgpu_in_reset(adev)) {
> >> That check doesn't makes sense now any more.
> > same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?
>
> amdgpu_in_reset() in used incorrect in quite a lot of places. It should
> only be used inside the HW backend code to distinct between initial load
> and reset.
>
> All other use cases especially the ones in the IOCTL front end functions
> as well as here in the midlayer which isn't used by GPU reset are incorrect.

FWIW, I started to clean this up earlier this year, but never got
around to finishing up the patches:
https://lists.freedesktop.org/archives/amd-gfx/2024-February/104582.html

Alex

>
> >
> >>> @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> >>>      struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
> >>>      struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
> >>>      unsigned int ndw;
> >>> -   signed long r;
> >>> +   signed long r = 0;
> >> Please don't initialize local variables if it isn't necessary.
> >>
> >>>              if (adev->gmc.flush_tlb_needs_extra_type_2)
> >>>                      adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >>>                                                               2, all_hub,
> >>> @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
> >>>              adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >>>                                                       flush_type, all_hub,
> >>>                                                       inst);
> >>> -           return 0;
> >>> -   }
> >>> +   } else {
> >> That the locking is missing here should be a separate bug fix independent of other changes.
> > I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.
>
> In that case better set r when it's not initialized in some path.
>
> Regards,
> Christian.
>