Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang: > It is to fix a bug of gpu_recovery on multiple GPUs, > When one gpu is reset, the application running on other > gpu hangs, because kfd post reset doesn't restore the > running process. This will resume all processes, even those that were affected by the GPU reset. The assumption we made here is, that KFD processes can use all GPUs. So when one GPU is reset, all processes are affected. If we want to refine that, we'll need some more complex logic to identify the processes affected by a GPU reset and keep only those processes suspended. This could be based on the GPUs that a process has created queues on, or allocated memory on. What we don't want, is processes continuing with bad data or inconsistent state after a GPU reset. > And it also fixes a bug in the function > kfd_process_evict_queues, when one gpu hangs, process > running on other gpus can't be evicted. This should be a separate patch. The code you're removing was there as an attempt to make kfd_process_evict_queues transactional. That means, it either succeeds completely or it fails completely. I wanted to avoid putting the system into an unknown state where some queues are suspended and others are not and the caller has no idea how to proceed. So I roll back a partial queue eviction if something failed. Your changing this to "try to evict as much as possible". Then a failure does not mean "eviction failed" but "eviction completed but something hung". Then the GPU reset can take care of the hanging part. I think that's a reasonable strategy. But we need to be sure that there are no other error conditions (other than hangs) that could cause a queue eviction to fail. There were some recent changes in pqm_destroy_queue that check the return value of dqm->ops.destroy_queue, which indirectly comes from execute_queues (sam as in the eviction case). -ETIME means a hang. Other errors are possible. Those other errors may still need the roll-back. Otherwise we'd leave stuff running on a non-hanging GPU after we promised that we evicted everything. See one more comment inline. > > Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_device.c | 2 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24 +----------------------- > 2 files changed, 2 insertions(+), 24 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > index 24b5e0aa1eac..daf1c19bd799 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c > @@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd) > if (!kfd->init_complete) > return 0; > > - ret = kfd_resume(kfd); > + ret = kgd2kfd_resume(kfd, false, true); Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only has two parameters. Regards, Felix > if (ret) > return ret; > atomic_dec(&kfd_locked); > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > index 38a9dee40785..9272a12c1db8 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > @@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct kfd_process *p) > { > int r = 0; > int i; > - unsigned int n_evicted = 0; > > for (i = 0; i < p->n_pdds; i++) { > struct kfd_process_device *pdd = p->pdds[i]; > > r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm, > &pdd->qpd); > - if (r) { > + if (r) > pr_err("Failed to evict process queues\n"); > - goto fail; > - } > - n_evicted++; > - } > - > - return r; > - > -fail: > - /* To keep state consistent, roll back partial eviction by > - * restoring queues > - */ > - for (i = 0; i < p->n_pdds; i++) { > - struct kfd_process_device *pdd = p->pdds[i]; > - > - if (n_evicted == 0) > - break; > - if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm, > - &pdd->qpd)) > - pr_err("Failed to restore queues\n"); > - > - n_evicted--; > } > > return r;