On 11/22/2018 06:16 AM, Christian König wrote: > How about using a lock per hive and then acquiring that with trylock() > instead? > > This way you should at least catch cases where multiple causes try to > reset the same hive at the same time. True that there is still some > racing involved, but it's at least a good start. What about using per hive work_struct which when scheduled will execute amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will get per hive serialization automatically from that without need of using any new mutex. Also this might be necessary anyway for RAS as I believe RAS will trigger interrupt when errors are detected and might then decide to reset the GPU so we will have to switch to button half context anyway. And I don't think it's a problem to schedule this work from job timeout handler if needed. > > > > Additional to that I would try improve the pre, middle, post handling > towards checking if we made some progress in between. > > In other words we stop all schedulers in the pre handling and > disconnect the scheduler fences from the hardware fence like I did in > patch "drm/sched: fix timeout handling v2". > > Then before we do the actual reset in the middle handling we check if > the offending job has completed or at least made some progress in the > meantime. I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ? Another question - what's the purpose of this progress check - if I've already completed the pre handling sequence I can't bail out even if the guilty job is is signaled by the time I do the progress check, I have to complete at least the post handling to. Do you mean I can at least skip the ASIC reset phase in that case ? Andrey > In the case of a manual reset we skip that because we don't have an > offending job to check. > > In the post handling we stitch everything together again and start the > scheduler to go on with job submission. > > Christian. > > Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: >> Depends what was the reason for triggering the reset for that node how >> do we know ? >> If the reason was RAS error that probably not hard to check all errors >> are cleared, but >> if the reason was job timeout on that specific node I will need to >> recheck that no jobs are left in incomplete state >> state. And if the reason is manual gpu reset trigger from sysfs, then >> what's the policy ? >> Sounds to me it's just easier to go ahead and allow all the pending >> resets to proceed unless there is >> a clear and quick criteria you can check after you grab the mutex then >> sure - but I don't know what it would be. >> >> Andrey >> >> On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: >>> I saw you use the global xgmi_mutex to prevent concurrent reset to be >>> triggered by different nodes , but after the mutex been released , >>> current node may grap the mutex and continue to do another reset . >>> Maybe we should check the GPU status and skip the reset in this case >>> since the GPU may already be in good state . >>> >>> Regards >>> >>> shaoyun.liu >>> >>> On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: >>>> For XGMI hive case do reset in steps where each step iterates over >>>> all devs in hive. This especially important for asic reset >>>> since all PSP FW in hive must come up within a limited time >>>> (around 1 sec) to properply negotiate the link. >>>> Do this by refactoring amdgpu_device_gpu_recover and >>>> amdgpu_device_reset >>>> into pre_asic_reset, asic_reset and post_asic_reset functions where >>>> is part >>>> is exectued for all the GPUs in the hive before going to the next >>>> step. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 >>>> ++++++++++++++++++++--------- >>>> 2 files changed, 264 insertions(+), 116 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index 4ef5f7a..bd06d45 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -1026,6 +1026,9 @@ struct amdgpu_device { >>>> unsigned long last_mm_index; >>>> bool in_gpu_reset; >>>> struct mutex lock_reset; >>>> + >>>> + int asic_reset_res; >>>> + int resched; >>>> }; >>>> static inline struct amdgpu_device *amdgpu_ttm_adev(struct >>>> ttm_bo_device *bdev) >>>> @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; >>>> struct list_head *amdgpu_xgmi_get_adev_list_handle(struct >>>> amdgpu_hive_info *hive); >>>> struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct >>>> amdgpu_device *adev); >>>> -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); >>>> +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, >>>> struct amdgpu_device *adev); >>>> int amdgpu_xgmi_add_device(struct amdgpu_device *adev); >>>> /* >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index cb06e68..8e94d7f 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct >>>> amdgpu_device *adev) >>>> return 0; >>>> } >>>> -/** >>>> - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough >>>> - * >>>> - * @adev: amdgpu device pointer >>>> - * >>>> - * attempt to do soft-reset or full-reset and reinitialize Asic >>>> - * return 0 means succeeded otherwise failed >>>> - */ >>>> -static int amdgpu_device_reset(struct amdgpu_device *adev) >>>> -{ >>>> - bool need_full_reset, vram_lost = 0; >>>> - int r; >>>> - >>>> - need_full_reset = amdgpu_device_ip_need_full_reset(adev); >>>> - >>>> - if (!need_full_reset) { >>>> - amdgpu_device_ip_pre_soft_reset(adev); >>>> - r = amdgpu_device_ip_soft_reset(adev); >>>> - amdgpu_device_ip_post_soft_reset(adev); >>>> - if (r || amdgpu_device_ip_check_soft_reset(adev)) { >>>> - DRM_INFO("soft reset failed, will fallback to full >>>> reset!\n"); >>>> - need_full_reset = true; >>>> - } >>>> - } >>>> - >>>> - if (need_full_reset) { >>>> - r = amdgpu_device_ip_suspend(adev); >>>> - >>>> -retry: >>>> - r = amdgpu_asic_reset(adev); >>>> - /* post card */ >>>> - amdgpu_atom_asic_init(adev->mode_info.atom_context); >>>> - >>>> - if (!r) { >>>> - dev_info(adev->dev, "GPU reset succeeded, trying to >>>> resume\n"); >>>> - r = amdgpu_device_ip_resume_phase1(adev); >>>> - if (r) >>>> - goto out; >>>> - >>>> - vram_lost = amdgpu_device_check_vram_lost(adev); >>>> - if (vram_lost) { >>>> - DRM_ERROR("VRAM is lost!\n"); >>>> - atomic_inc(&adev->vram_lost_counter); >>>> - } >>>> - >>>> - r = amdgpu_gtt_mgr_recover( >>>> - &adev->mman.bdev.man[TTM_PL_TT]); >>>> - if (r) >>>> - goto out; >>>> - >>>> - r = amdgpu_device_fw_loading(adev); >>>> - if (r) >>>> - return r; >>>> - >>>> - r = amdgpu_device_ip_resume_phase2(adev); >>>> - if (r) >>>> - goto out; >>>> - >>>> - if (vram_lost) >>>> - amdgpu_device_fill_reset_magic(adev); >>>> - } >>>> - } >>>> - >>>> -out: >>>> - if (!r) { >>>> - amdgpu_irq_gpu_reset_resume_helper(adev); >>>> - r = amdgpu_ib_ring_tests(adev); >>>> - if (r) { >>>> - dev_err(adev->dev, "ib ring test failed (%d).\n", r); >>>> - r = amdgpu_device_ip_suspend(adev); >>>> - need_full_reset = true; >>>> - goto retry; >>>> - } >>>> - } >>>> - >>>> - if (!r) >>>> - r = amdgpu_device_recover_vram(adev); >>>> - >>>> - return r; >>>> -} >>>> /** >>>> * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf >>>> @@ -3335,31 +3255,16 @@ bool >>>> amdgpu_device_should_recover_gpu(struct amdgpu_device *adev) >>>> return false; >>>> } >>>> -/** >>>> - * amdgpu_device_gpu_recover - reset the asic and recover scheduler >>>> - * >>>> - * @adev: amdgpu device pointer >>>> - * @job: which job trigger hang >>>> - * >>>> - * Attempt to reset the GPU if it has hung (all asics). >>>> - * Returns 0 for success or an error on failure. >>>> - */ >>>> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>> - struct amdgpu_job *job) >>>> -{ >>>> - int i, r, resched; >>>> - >>>> - dev_info(adev->dev, "GPU reset begin!\n"); >>>> - >>>> - mutex_lock(&adev->lock_reset); >>>> - atomic_inc(&adev->gpu_reset_counter); >>>> - adev->in_gpu_reset = 1; >>>> - /* Block kfd */ >>>> - amdgpu_amdkfd_pre_reset(adev); >>>> +static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, >>>> + struct amdgpu_job *job, >>>> + bool *need_full_reset_arg) >>>> +{ >>>> + int i, r = 0; >>>> + bool need_full_reset = *need_full_reset_arg; >>>> /* block TTM */ >>>> - resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); >>>> + adev->resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); >>>> /* block all schedulers and reset given job's ring */ >>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>> @@ -3379,10 +3284,121 @@ int amdgpu_device_gpu_recover(struct >>>> amdgpu_device *adev, >>>> amdgpu_fence_driver_force_completion(ring); >>>> } >>>> - if (amdgpu_sriov_vf(adev)) >>>> - r = amdgpu_device_reset_sriov(adev, job ? false : true); >>>> - else >>>> - r = amdgpu_device_reset(adev); >>>> + if (!amdgpu_sriov_vf(adev)) { >>>> + >>>> + if (!need_full_reset) >>>> + need_full_reset = amdgpu_device_ip_need_full_reset(adev); >>>> + >>>> + if (!need_full_reset) { >>>> + amdgpu_device_ip_pre_soft_reset(adev); >>>> + r = amdgpu_device_ip_soft_reset(adev); >>>> + amdgpu_device_ip_post_soft_reset(adev); >>>> + if (r || amdgpu_device_ip_check_soft_reset(adev)) { >>>> + DRM_INFO("soft reset failed, will fallback to full >>>> reset!\n"); >>>> + need_full_reset = true; >>>> + } >>>> + } >>>> + >>>> + if (need_full_reset) >>>> + r = amdgpu_device_ip_suspend(adev); >>>> + >>>> + *need_full_reset_arg = need_full_reset; >>>> + } >>>> + >>>> + return r; >>>> +} >>>> + >>>> +static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, >>>> + struct list_head *device_list_handle, >>>> + bool *need_full_reset_arg) >>>> +{ >>>> + struct amdgpu_device *tmp_adev = NULL; >>>> + bool need_full_reset = *need_full_reset_arg, vram_lost = false; >>>> + int r = 0; >>>> + >>>> + /* >>>> + * ASIC reset has to be done on all HGMI hive nodes ASAP >>>> + * to allow proper links negotiation in FW (within 1 sec) >>>> + */ >>>> + if (need_full_reset) { >>>> + list_for_each_entry(tmp_adev, device_list_handle, >>>> gmc.xgmi.head) { >>>> + r = amdgpu_asic_reset(tmp_adev); >>>> + if (r) >>>> + DRM_WARN("ASIC reset failed with err r, %d for drm >>>> dev, %s", >>>> + r, tmp_adev->ddev->unique); >>>> + } >>>> + } >>>> + >>>> + >>>> + list_for_each_entry(tmp_adev, device_list_handle, >>>> gmc.xgmi.head) { >>>> + if (need_full_reset) { >>>> + /* post card */ >>>> + if >>>> (amdgpu_atom_asic_init(tmp_adev->mode_info.atom_context)) >>>> + DRM_WARN("asic atom init failed!"); >>>> + >>>> + if (!r) { >>>> + dev_info(tmp_adev->dev, "GPU reset succeeded, >>>> trying to resume\n"); >>>> + r = amdgpu_device_ip_resume_phase1(tmp_adev); >>>> + if (r) >>>> + goto out; >>>> + >>>> + vram_lost = amdgpu_device_check_vram_lost(tmp_adev); >>>> + if (vram_lost) { >>>> + DRM_ERROR("VRAM is lost!\n"); >>>> + atomic_inc(&tmp_adev->vram_lost_counter); >>>> + } >>>> + >>>> + r = amdgpu_gtt_mgr_recover( >>>> + &tmp_adev->mman.bdev.man[TTM_PL_TT]); >>>> + if (r) >>>> + goto out; >>>> + >>>> + r = amdgpu_device_fw_loading(tmp_adev); >>>> + if (r) >>>> + return r; >>>> + >>>> + r = amdgpu_device_ip_resume_phase2(tmp_adev); >>>> + if (r) >>>> + goto out; >>>> + >>>> + if (vram_lost) >>>> + amdgpu_device_fill_reset_magic(tmp_adev); >>>> + >>>> + /* Update PSP FW topology after reset */ >>>> + if (tmp_adev->gmc.xgmi.num_physical_nodes > 1) >>>> + r = amdgpu_xgmi_update_topology(hive, tmp_adev); >>>> + } >>>> + } >>>> + >>>> + >>>> +out: >>>> + if (!r) { >>>> + amdgpu_irq_gpu_reset_resume_helper(tmp_adev); >>>> + r = amdgpu_ib_ring_tests(tmp_adev); >>>> + if (r) { >>>> + dev_err(tmp_adev->dev, "ib ring test failed >>>> (%d).\n", r); >>>> + r = amdgpu_device_ip_suspend(tmp_adev); >>>> + need_full_reset = true; >>>> + r = -EAGAIN; >>>> + goto end; >>>> + } >>>> + } >>>> + >>>> + if (!r) >>>> + r = amdgpu_device_recover_vram(tmp_adev); >>>> + else >>>> + tmp_adev->asic_reset_res = r; >>>> + } >>>> + >>>> +end: >>>> + *need_full_reset_arg = need_full_reset; >>>> + return r; >>>> +} >>>> + >>>> +static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev, >>>> + struct amdgpu_job *job) >>>> +{ >>>> + int i; >>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>> struct amdgpu_ring *ring = adev->rings[i]; >>>> @@ -3394,7 +3410,7 @@ int amdgpu_device_gpu_recover(struct >>>> amdgpu_device *adev, >>>> * or all rings (in the case @job is NULL) >>>> * after above amdgpu_reset accomplished >>>> */ >>>> - if ((!job || job->base.sched == &ring->sched) && !r) >>>> + if ((!job || job->base.sched == &ring->sched) && >>>> !adev->asic_reset_res) >>>> drm_sched_job_recovery(&ring->sched); >>>> kthread_unpark(ring->sched.thread); >>>> @@ -3404,21 +3420,150 @@ int amdgpu_device_gpu_recover(struct >>>> amdgpu_device *adev, >>>> drm_helper_resume_force_mode(adev->ddev); >>>> } >>>> - ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched); >>>> + ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, adev->resched); >>>> - if (r) { >>>> - /* bad news, how to tell it to userspace ? */ >>>> - dev_info(adev->dev, "GPU reset(%d) failed\n", >>>> atomic_read(&adev->gpu_reset_counter)); >>>> - amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, >>>> 0, r); >>>> - } else { >>>> - dev_info(adev->dev, "GPU reset(%d) >>>> succeeded!\n",atomic_read(&adev->gpu_reset_counter)); >>>> - } >>>> + adev->asic_reset_res = adev->resched = 0; >>>> + >>>> +} >>>> +static void amdgpu_lock_adev(struct amdgpu_device *adev) >>>> +{ >>>> + mutex_lock(&adev->lock_reset); >>>> + atomic_inc(&adev->gpu_reset_counter); >>>> + adev->in_gpu_reset = 1; >>>> + /* Block kfd */ >>>> + amdgpu_amdkfd_pre_reset(adev); >>>> +} >>>> + >>>> +static void amdgpu_unlock_adev(struct amdgpu_device *adev) >>>> +{ >>>> /*unlock kfd */ >>>> amdgpu_amdkfd_post_reset(adev); >>>> amdgpu_vf_error_trans_all(adev); >>>> adev->in_gpu_reset = 0; >>>> mutex_unlock(&adev->lock_reset); >>>> +} >>>> + >>>> + >>>> +/** >>>> + * amdgpu_device_gpu_recover - reset the asic and recover scheduler >>>> + * >>>> + * @adev: amdgpu device pointer >>>> + * @job: which job trigger hang >>>> + * >>>> + * Attempt to reset the GPU if it has hung (all asics). >>>> + * Attempt to do soft-reset or full-reset and reinitialize Asic >>>> + * Returns 0 for success or an error on failure. >>>> + */ >>>> + >>>> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>> + struct amdgpu_job *job) >>>> +{ >>>> + int r; >>>> + struct amdgpu_hive_info *hive = NULL; >>>> + bool need_full_reset = false; >>>> + struct amdgpu_device *tmp_adev = NULL; >>>> + struct list_head device_list, *device_list_handle = NULL; >>>> + >>>> + INIT_LIST_HEAD(&device_list); >>>> + >>>> + dev_info(adev->dev, "GPU reset begin!\n"); >>>> + >>>> + /* >>>> + * In case of XGMI hive disallow concurrent resets to be >>>> triggered >>>> + * by different nodes. >>>> + */ >>>> + if (adev->gmc.xgmi.num_physical_nodes > 1) >>>> + mutex_lock(&xgmi_mutex); >>>> + >>>> + /* Start with adev pre asic reset first for soft reset check.*/ >>>> + amdgpu_lock_adev(adev); >>>> + r = amdgpu_device_pre_asic_reset(adev, >>>> + job, >>>> + &need_full_reset); >>>> + if (r) { >>>> + /*TODO Should we stop ?*/ >>>> + DRM_ERROR("GPU pre asic reset failed with err, %d for drm >>>> dev, %s ", >>>> + r, adev->ddev->unique); >>>> + adev->asic_reset_res = r; >>>> + } >>>> + >>>> + /* Build list of devices to reset */ >>>> + if (need_full_reset && adev->gmc.xgmi.num_physical_nodes > 1) { >>>> + hive = amdgpu_get_xgmi_hive(adev); >>>> + if (!hive) { >>>> + r = -ENODEV; >>>> + >>>> + amdgpu_unlock_adev(adev); >>>> + >>>> + if (adev->gmc.xgmi.num_physical_nodes > 1) >>>> + mutex_unlock(&xgmi_mutex); >>>> + return r; >>>> + } >>>> + >>>> + /* >>>> + * In case we are in XGMI hive mode device reset is done >>>> for all the >>>> + * nodes in the hive to retrain all XGMI links and hence >>>> the reset >>>> + * sequence is executed in loop on all nodes. >>>> + */ >>>> + device_list_handle = amdgpu_xgmi_get_adev_list_handle(hive); >>>> + } else { >>>> + list_add_tail(&adev->gmc.xgmi.head, &device_list); >>>> + device_list_handle = &device_list; >>>> + } >>>> + >>>> +retry: /* Rest of adevs pre asic reset from XGMI hive. */ >>>> + list_for_each_entry(tmp_adev, device_list_handle, >>>> gmc.xgmi.head) { >>>> + >>>> + if (tmp_adev == adev) >>>> + continue; >>>> + >>>> + dev_info(tmp_adev->dev, "GPU reset begin for drm dev >>>> %s!\n", adev->ddev->unique); >>>> + >>>> + amdgpu_lock_adev(tmp_adev); >>>> + r = amdgpu_device_pre_asic_reset(tmp_adev, >>>> + NULL, >>>> + &need_full_reset); >>>> + /*TODO Should we stop ?*/ >>>> + if (r) { >>>> + DRM_ERROR("GPU pre asic reset failed with err, %d for >>>> drm dev, %s ", >>>> + r, tmp_adev->ddev->unique); >>>> + tmp_adev->asic_reset_res = r; >>>> + } >>>> + } >>>> + >>>> + /* Actual ASIC resets if needed.*/ >>>> + /* TODO Implement XGMI hive reset logic for SRIOV */ >>>> + if (amdgpu_sriov_vf(adev)) { >>>> + r = amdgpu_device_reset_sriov(adev, job ? false : true); >>>> + if (r) >>>> + adev->asic_reset_res = r; >>>> + } else { >>>> + r = amdgpu_do_asic_reset(hive, device_list_handle, >>>> &need_full_reset); >>>> + if (r && r == -EAGAIN) >>>> + goto retry; >>>> + } >>>> + >>>> + /* Post ASIC reset for all devs .*/ >>>> + list_for_each_entry(tmp_adev, device_list_handle, >>>> gmc.xgmi.head) { >>>> + amdgpu_device_post_asic_reset(tmp_adev, tmp_adev == adev ? >>>> job : NULL); >>>> + >>>> + if (r) { >>>> + /* bad news, how to tell it to userspace ? */ >>>> + dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", >>>> atomic_read(&adev->gpu_reset_counter)); >>>> + amdgpu_vf_error_put(tmp_adev, >>>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); >>>> + } else { >>>> + dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", >>>> atomic_read(&adev->gpu_reset_counter)); >>>> + } >>>> + >>>> + amdgpu_unlock_adev(tmp_adev); >>>> + } >>>> + >>>> + if (adev->gmc.xgmi.num_physical_nodes > 1) >>>> + mutex_unlock(&xgmi_mutex); >>>> + >>>> + if (r) >>>> + dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>> return r; >>>> } >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx