Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Marek Olšák <maraeo@xxxxxxxxx> · Tue, 8 Aug 2023 13:03:11 -0400

It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.

Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick <sebastian.wick@xxxxxxxxxx> wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx>
> > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@xxxxxxxxxx/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> >
> > Since we do not seem to have a solid consensus in the community about
> > non-robust userspace, maybe we could just document that lack of consensus
> > to unblock this patch? Something like this:
> >
> > Non-Robust Userspace
> > --------------------
> >
> > Userspace that doesn't support robust interfaces (like an non-robust
> > OpenGL context or API without any robustness support like libva) leave the
> > robustness handling entirely to the userspace driver. There is no strong
> > community consensus on what the userspace driver should do in that case,
> > since all reasonable approaches have some clear downsides.
> >
> > With the s/UMD/KMD/ further up and maybe something added to record the
> > non-robustness non-consensus:
> >
> > Acked-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
> >
> > Cheers, Daniel
> >
> >
> >
> > > +
> > >  .. _drm_driver_ioctl:
> > >
> > >  IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >
>