If we did use force flag and amdgpu_gpu_recovery = 0 , the reset will be ignored . I'm kind of like this reset can go through like sriov . If we depends on the parameter amdgpu_gpu_recovery , it may think the GPU is hang and trigger the GPU reset when rocm submit some heavy compute stuff running and actually not hang . Regards Shaoyun.liu -----Original Message----- From: Christian König [mailto:ckoenig.leichtzumerken@xxxxxxxxx] Sent: Friday, January 26, 2018 12:41 PM To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset Am 26.01.2018 um 18:38 schrieb Shaoyun Liu: > Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d > Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 6 ++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++ > 3 files changed, 30 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > index 2d99099..cb1ee26 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev) > return r; > } > > +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) { > + if (adev->kfd) > + kgd2kfd->pre_reset(adev->kfd); > +} > + > +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) { > + if (adev->kfd) > + kgd2kfd->post_reset(adev->kfd); > +} > + > +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) { > + struct amdgpu_device *adev = (struct amdgpu_device *)kgd; > + > + amdgpu_device_gpu_recover(adev, NULL, true); Use false for the force parameter here, apart from that the set looks good to me. Regards, Christian. > +} > + > int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine, > uint32_t vmid, uint64_t gpu_addr, > uint32_t *ib_cmd, uint32_t ib_len) diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > index 7c36e52..230761f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h > @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem, > bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, > u32 vmid); > > +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev); > + > +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev); > + > +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd); > + > /* Shared API */ > int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm, > struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 94f837b..61e7d35 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > atomic_inc(&adev->gpu_reset_counter); > adev->in_gpu_reset = 1; > > + /* Block kfd */ > + amdgpu_amdkfd_pre_reset(adev); > + > /* block TTM */ > resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); > /* store modesetting */ > @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); > } else { > dev_info(adev->dev, "GPU reset(%d) > successed!\n",atomic_read(&adev->gpu_reset_counter)); > + /*unlock kfd after a successfully recovery*/ > + amdgpu_amdkfd_post_reset(adev); > } > > amdgpu_vf_error_trans_all(adev);