On Thu, May 23, 2024 at 11:32 AM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy): > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >>> +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void > >>> +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int > >>> +amdgpu_begin_hw_access(struct amdgpu_device *adev); void > >>> +amdgpu_end_hw_access(struct amdgpu_device *adev); > >> Don't add anything to amdgpu.h. We are slowly decomposing that file. > > Where would be a better place? I just wanted to have them next to amdgpu_in_reset > > amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over > there as well. > > > > >>> @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > >>> goto skip_hw_reset; > >>> } > >>> > >>> + amdgpu_lock_hw_access(adev); > >> That should already be locked here. So this will most likely deadlock. > >> > >>> retry: /* Rest of adevs pre asic reset from XGMI hive. */ > >>> list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > >>> r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ > >>> -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > >>> */ > >>> amdgpu_device_stop_pending_resets(tmp_adev); > >>> } > >>> + amdgpu_unlock_hw_access(adev); > >>> > >>> skip_hw_reset: > >> Don't add helpers for that, especially not with that name. > >> > >> We don't block HW access, but just prevent concurrent resets. > > Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path. > > Exactly that is what you should *not* do. Please don't add any new lock > to the code. This is already complicated enough. > > If you think that adding wrappers for reset lock makes sense then we can > probably do that, bot don't add any lock for hw access. > > > > > >>> uint32_t seq; > >>> > >>> - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > >>> - !down_read_trylock(&adev->reset_domain->sem)) { > >>> + /* > >>> + * A GPU reset should flush all TLBs anyway, so no need to do > >>> + * this while one is ongoing. > >>> + */ > >>> + if (!amdgpu_begin_hw_access(adev)) > >>> + return 0; > >>> > >>> + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > >>> + amdgpu_in_reset(adev)) { > >> That check doesn't makes sense now any more. > > same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens? > > amdgpu_in_reset() in used incorrect in quite a lot of places. It should > only be used inside the HW backend code to distinct between initial load > and reset. > > All other use cases especially the ones in the IOCTL front end functions > as well as here in the midlayer which isn't used by GPU reset are incorrect. FWIW, I started to clean this up earlier this year, but never got around to finishing up the patches: https://lists.freedesktop.org/archives/amd-gfx/2024-February/104582.html Alex > > > > >>> @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, > >>> struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring; > >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst]; > >>> unsigned int ndw; > >>> - signed long r; > >>> + signed long r = 0; > >> Please don't initialize local variables if it isn't necessary. > >> > >>> if (adev->gmc.flush_tlb_needs_extra_type_2) > >>> adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > >>> 2, all_hub, > >>> @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, > >>> adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > >>> flush_type, all_hub, > >>> inst); > >>> - return 0; > >>> - } > >>> + } else { > >> That the locking is missing here should be a separate bug fix independent of other changes. > > I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock. > > In that case better set r when it's not initialized in some path. > > Regards, > Christian. >