Re: [PATCH v2 16/34] drm/msm: Mark VM as unusable on faults

Connor Abbott <cwabbott0@xxxxxxxxx> · Wed, 19 Mar 2025 12:15:36 -0400

On Wed, Mar 19, 2025 at 10:55 AM Rob Clark <robdclark@xxxxxxxxx> wrote:
>
> From: Rob Clark <robdclark@xxxxxxxxxxxx>
>
> If userspace has opted-in to VM_BIND, then GPU faults and VM_BIND errors
> will mark the VM as unusable.
>
> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> ---
>  drivers/gpu/drm/msm/msm_gem.h        | 17 +++++++++++++++++
>  drivers/gpu/drm/msm/msm_gem_submit.c |  3 +++
>  drivers/gpu/drm/msm/msm_gpu.c        | 16 ++++++++++++++--
>  3 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index acb976722580..7cb720137548 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -82,6 +82,23 @@ struct msm_gem_vm {
>
>         /** @managed: is this a kernel managed VM? */
>         bool managed;
> +
> +       /**
> +        * @unusable: True if the VM has turned unusable because something
> +        * bad happened during an asynchronous request.
> +        *
> +        * We don't try to recover from such failures, because this implies
> +        * informing userspace about the specific operation that failed, and
> +        * hoping the userspace driver can replay things from there. This all
> +        * sounds very complicated for little gain.
> +        *
> +        * Instead, we should just flag the VM as unusable, and fail any
> +        * further request targeting this VM.
> +        *
> +        * As an analogy, this would be mapped to a VK_ERROR_DEVICE_LOST
> +        * situation, where the logical device needs to be re-created.
> +        */
> +       bool unusable;
>  };
>  #define to_msm_vm(x) container_of(x, struct msm_gem_vm, base)
>
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 9731ad7993cf..9cef308a0ad1 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -668,6 +668,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
>         if (args->pad)
>                 return -EINVAL;
>
> +       if (to_msm_vm(ctx->vm)->unusable)
> +               return UERR(EPIPE, dev, "context is unusable");
> +
>         /* for now, we just have 3d pipe.. eventually this would need to
>          * be more clever to dispatch to appropriate gpu module:
>          */
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 503e4dcc5a6f..4831f4e42fd9 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -386,8 +386,20 @@ static void recover_worker(struct kthread_work *work)
>
>         /* Increment the fault counts */
>         submit->queue->faults++;
> -       if (submit->vm)
> -               to_msm_vm(submit->vm)->faults++;
> +       if (submit->vm) {
> +               struct msm_gem_vm *vm = to_msm_vm(submit->vm);
> +
> +               vm->faults++;
> +
> +               /*
> +                * If userspace has opted-in to VM_BIND (and therefore userspace
> +                * management of the VM), faults mark the VM as unusuable.  This
> +                * matches vulkan expectations (vulkan is the main target for
> +                * VM_BIND)

The bit about this matching Vulkan expectations isn't exactly true.
Some Vulkan implementations do do this, but many will also just ignore
the fault and try to continue going, and the spec allows either. It's
a choice that we're making.

Connor

> +                */
> +               if (!vm->managed)
> +                       vm->unusable = true;
> +       }
>
>         get_comm_cmdline(submit, &comm, &cmd);
>
> --
> 2.48.1
>