Re: [PATCH] drm/amdkfd: Restore all process on post reset

Felix Kuehling <felix.kuehling@xxxxxxx> · Fri, 30 Jul 2021 17:26:00 -0400

Am 2021-07-28 um 1:31 p.m. schrieb Eric Huang:
> It is to fix a bug of gpu_recovery on multiple GPUs,
> When one gpu is reset, the application running on other
> gpu hangs, because kfd post reset doesn't restore the
> running process.

This will resume all processes, even those that were affected by the GPU
reset.

The assumption we made here is, that KFD processes can use all GPUs. So
when one GPU is reset, all processes are affected. If we want to refine
that, we'll need some more complex logic to identify the processes
affected by a GPU reset and keep only those processes suspended. This
could be based on the GPUs that a process has created queues on, or
allocated memory on.

What we don't want, is processes continuing with bad data or
inconsistent state after a GPU reset.

>  And it also fixes a bug in the function
> kfd_process_evict_queues, when one gpu hangs, process
> running on other gpus can't be evicted.

This should be a separate patch. The code you're removing was there as
an attempt to make kfd_process_evict_queues transactional. That means,
it either succeeds completely or it fails completely. I wanted to avoid
putting the system into an unknown state where some queues are suspended
and others are not and the caller has no idea how to proceed. So I roll
back a partial queue eviction if something failed.

Your changing this to "try to evict as much as possible". Then a failure
does not mean "eviction failed" but "eviction completed but something
hung". Then the GPU reset can take care of the hanging part. I think
that's a reasonable strategy. But we need to be sure that there are no
other error conditions (other than hangs) that could cause a queue
eviction to fail.

There were some recent changes in pqm_destroy_queue that check the
return value of dqm->ops.destroy_queue, which indirectly comes from
execute_queues (sam as in the eviction case). -ETIME means a hang. Other
errors are possible. Those other errors may still need the roll-back.
Otherwise we'd leave stuff running on a non-hanging GPU after we
promised that we evicted everything.

See one more comment inline.

>
> Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device.c  |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 24 +-----------------------
>  2 files changed, 2 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> index 24b5e0aa1eac..daf1c19bd799 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
> @@ -984,7 +984,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>  	if (!kfd->init_complete)
>  		return 0;
>  
> -	ret = kfd_resume(kfd);
> +	ret = kgd2kfd_resume(kfd, false, true);

Which branch is this for? On amd-staging-drm-next kgd2kfd_resume only
has two parameters.

Regards,
  Felix

>  	if (ret)
>  		return ret;
>  	atomic_dec(&kfd_locked);
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 38a9dee40785..9272a12c1db8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -1879,36 +1879,14 @@ int kfd_process_evict_queues(struct kfd_process *p)
>  {
>  	int r = 0;
>  	int i;
> -	unsigned int n_evicted = 0;
>  
>  	for (i = 0; i < p->n_pdds; i++) {
>  		struct kfd_process_device *pdd = p->pdds[i];
>  
>  		r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm,
>  							    &pdd->qpd);
> -		if (r) {
> +		if (r)
>  			pr_err("Failed to evict process queues\n");
> -			goto fail;
> -		}
> -		n_evicted++;
> -	}
> -
> -	return r;
> -
> -fail:
> -	/* To keep state consistent, roll back partial eviction by
> -	 * restoring queues
> -	 */
> -	for (i = 0; i < p->n_pdds; i++) {
> -		struct kfd_process_device *pdd = p->pdds[i];
> -
> -		if (n_evicted == 0)
> -			break;
> -		if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm,
> -							      &pdd->qpd))
> -			pr_err("Failed to restore queues\n");
> -
> -		n_evicted--;
>  	}
>  
>  	return r;