[AMD Official Use Only - AMD Internal Distribution Only] > > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void > > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int > > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void > > +amdgpu_end_hw_access(struct amdgpu_device *adev); > > Don't add anything to amdgpu.h. We are slowly decomposing that file. Where would be a better place? I just wanted to have them next to amdgpu_in_reset > > @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > > goto skip_hw_reset; > > } > > > > + amdgpu_lock_hw_access(adev); > > That should already be locked here. So this will most likely deadlock. > > > retry: /* Rest of adevs pre asic reset from XGMI hive. */ > > list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > > r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ > > -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > > */ > > amdgpu_device_stop_pending_resets(tmp_adev); > > } > > + amdgpu_unlock_hw_access(adev); > > > > skip_hw_reset: > > Don't add helpers for that, especially not with that name. > > We don't block HW access, but just prevent concurrent resets. Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path. > > uint32_t seq; > > > > - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > > - !down_read_trylock(&adev->reset_domain->sem)) { > > + /* > > + * A GPU reset should flush all TLBs anyway, so no need to do > > + * this while one is ongoing. > > + */ > > + if (!amdgpu_begin_hw_access(adev)) > > + return 0; > > > > + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > > + amdgpu_in_reset(adev)) { > > That check doesn't makes sense now any more. same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens? > > > > @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, > > struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring; > > struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst]; > > unsigned int ndw; > > - signed long r; > > + signed long r = 0; > > Please don't initialize local variables if it isn't necessary. > > > if (adev->gmc.flush_tlb_needs_extra_type_2) > > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > > 2, all_hub, > > @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, > > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > > flush_type, all_hub, > > inst); > > - return 0; > > - } > > + } else { > > That the locking is missing here should be a separate bug fix independent of other changes. I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.