Re: [PATCH] drm/nouveau: sched: fix leaking memory of timedout job

Lyude Paul <lyude@xxxxxxxxxx> · Mon, 18 Sep 2023 18:19:13 -0400

BTW - Would you like me to review work like this? I'm totally happy to do
that, although I'm not terribly familiar with these parts of nouveau/drm (but
I'm always willing to learn, and would like to know more about these areas
anyway :)

…if the answer is yes, this patch looks fine to me so far - I guess the one
question I have that might have an obvious answer - how are jobs without an
job->ops->timeout callback cleaned up?

On Sat, 2023-09-16 at 18:28 +0200, Danilo Krummrich wrote:
> Always stop and re-start the scheduler in order to let the scheduler
> free up the timedout job in case it got signaled. In case of exec jobs
> the job type specific callback will take care to signal all fences and
> tear down the channel.
> 
> Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
> Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>
> ---
>  drivers/gpu/drm/nouveau/nouveau_exec.c  |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c | 12 +++++++++---
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
> index 9c031d15fe0b..49d83ac9e036 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_exec.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
> @@ -185,7 +185,7 @@ nouveau_exec_job_timeout(struct nouveau_job *job)
>  
>  	nouveau_sched_entity_fini(job->entity);
>  
> -	return DRM_GPU_SCHED_STAT_ENODEV;
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
>  static struct nouveau_job_ops nouveau_exec_job_ops = {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 88217185e0f3..3b7ea5221226 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -375,14 +375,20 @@ nouveau_sched_run_job(struct drm_sched_job *sched_job)
>  static enum drm_gpu_sched_stat
>  nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
> +	struct drm_gpu_scheduler *sched = sched_job->sched;
>  	struct nouveau_job *job = to_nouveau_job(sched_job);
> +	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
>  
> -	NV_PRINTK(warn, job->cli, "Job timed out.\n");
> +	drm_sched_stop(sched, sched_job);
>  
>  	if (job->ops->timeout)
> -		return job->ops->timeout(job);
> +		stat = job->ops->timeout(job);
> +	else
> +		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
> +
> +	drm_sched_start(sched, true);
>  
> -	return DRM_GPU_SCHED_STAT_ENODEV;
> +	return stat;
>  }
>  
>  static void

-- 
Cheers,
 Lyude Paul (she/her)
 Software Engineer at Red Hat