Re: [PATCH 1/2] drm/sched: Add error code parameter to drm_sched_start

Alex Deucher <alexdeucher@xxxxxxxxx> · Thu, 25 Jul 2024 12:30:07 -0400



On Wed, Jul 24, 2024 at 11:30 PM <vitaly.prosyak@xxxxxxx> wrote:
>
> From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
>
> The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
> the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
> each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
> methods, whether a queue reset or a full GPU reset was used.
>
> To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA.
> If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for
> the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds
> an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and
> GPU reset failures. This enables user mode and test applications to validate the expected
> correctness of the requested operation. After a successful queue reset, the only way to continue
> normal operation is to call drm_sched_job_done with the specific error code -ENODATA.
>
> v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
>     amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish
>     between queue reset and GPU reset.
> v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU
>     reset, returned to amdgpu_cs_wait_ioctl.
> v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter
>     to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose
>     of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset.
> v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes
>     the function's purpose. Additionally, it was recommended to add documentation details about the new method.
>
> Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> Cc: Christian Koenig <christian.koenig@xxxxxxx>
> Signed-off-by: Jesse Zhang <Jesse.Zhang@xxxxxxx>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 30 +++++++++++++++++++++++---
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7e90c9f95611..c42449358b3f 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -671,13 +671,24 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  EXPORT_SYMBOL(drm_sched_stop);
>
>  /**
> - * drm_sched_start - recover jobs after a reset
> + * drm_sched_start_with_recovery_error - recover jobs after a reset with
> + * custom error
>   *
>   * @sched: scheduler instance
>   * @full_recovery: proceed with complete sched restart
> + * @error : err code for set dma_fence_set_error
> + *
> + * Starts the scheduler and allows setting a custom dma_fence_set_error,
> + * which can be used to identify the recovery mechanism actually used.
>   *
> + * For example:
> + * - If a soft or queue reset was used, dma_fence_set_error is set to -ENODATA.
> + * - If an entire GPU reset was used, the error code is set to -ECANCELED.
> + *
> + * This approach enables user mode and test applications to know which
> + * recovery method was used for a given bad job.
>   */
> -void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +void drm_sched_start_with_recovery_error(struct drm_gpu_scheduler *sched, bool full_recovery, int error)
>  {
>         struct drm_sched_job *s_job, *tmp;
>         int r;
> @@ -704,7 +715,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>                                 DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
>                                           r);
>                 } else
> -                       drm_sched_job_done(s_job, -ECANCELED);
> +                       drm_sched_job_done(s_job, error);
>         }
>
>         if (full_recovery)
> @@ -712,6 +723,19 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
>
>         drm_sched_wqueue_start(sched);
>  }
> +EXPORT_SYMBOL(drm_sched_start_with_recovery_error);
> +
> +/**
> + * drm_sched_start - recover jobs after a reset
> + *
> + * @sched: scheduler instance
> + * @full_recovery: proceed with complete sched restart
> + *
> + */
> +void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
> +{
> +       drm_sched_start_with_recovery_error(sched, full_recovery, -ECANCELED);
> +}
>  EXPORT_SYMBOL(drm_sched_start);
>
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 5acc64954a88..444fa6761590 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -580,6 +580,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
>  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> +void drm_sched_start_ex(struct drm_gpu_scheduler *sched, bool full_recovery, int error);

drm_sched_start_with_recovery_error()

Alex

>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
>  void drm_sched_increase_karma(struct drm_sched_job *bad);
>  void drm_sched_reset_karma(struct drm_sched_job *bad);
> --
> 2.25.1
>