On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick <sebastian.wick@xxxxxxxxxx> wrote: > > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@xxxxxxxxxx> wrote: > > > > Create a section that specifies how to deal with DRM device resets for > > kernel and userspace drivers. > > > > Acked-by: Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx> > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> > > --- > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@xxxxxxxxxx/ > > > > Changes: > > - Grammar fixes (Randy) > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 68 insertions(+) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 65fb3036a580..3cbffa25ed93 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > mmapped regular files. Threads cause additional pain with signal > > handling as well. > > > > +Device reset > > +============ > > + > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > +faulty applications and everything in between the many layers. Some errors > > +require resetting the device in order to make the device usable again. This > > +sections describes the expectations for DRM and usermode drivers when a > > +device resets and how to propagate the reset status. > > + > > +Kernel Mode Driver > > +------------------ > > + > > +The KMD is responsible for checking if the device needs a reset, and to perform > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > +should keep track of resets, because userspace can query any time about the > > +reset stats for an specific context. This is needed to propagate to the rest of > > +the stack that a reset has happened. Currently, this is implemented by each > > +driver separately, with no common DRM interface. > > + > > +User Mode Driver > > +---------------- > > + > > +The UMD should check before submitting new commands to the KMD if the device has > > +been reset, and this can be checked more often if the UMD requires it. After > > +detecting a reset, UMD will then proceed to report it to the application using > > +the appropriate API error code, as explained in the section below about > > +robustness. > > + > > +Robustness > > +---------- > > + > > +The only way to try to keep an application working after a reset is if it > > +complies with the robustness aspects of the graphical API that it is using. > > + > > +Graphical APIs provide ways to applications to deal with device resets. However, > > +there is no guarantee that the app will use such features correctly, and the > > +UMD can implement policies to close the app if it is a repeating offender, > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > +the user interface from being correctly displayed. This should be done even if > > +the app is correct but happens to trigger some bug in the hardware/driver. > > I still don't think it's good to let the kernel arbitrarily kill > processes that it thinks are not well-behaved based on some heuristics > and policy. > > Can't this be outsourced to user space? Expose the information about > processes causing a device and let e.g. systemd deal with coming up > with a policy and with killing stuff. I don't think it's the kernel doing the killing, it would be the UMD. E.g., if the app is guilty and doesn't support robustness the UMD can just call exit(). Alex > > > + > > +OpenGL > > +~~~~~~ > > + > > +Apps using OpenGL should use the available robust interfaces, like the > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > +interface tells if a reset has happened, and if so, all the context state is > > +considered lost and the app proceeds by creating new ones. If it is possible to > > +determine that robustness is not in use, the UMD will terminate the app when a > > +reset is detected, giving that the contexts are lost and the app won't be able > > +to figure this out and recreate the contexts. > > + > > +Vulkan > > +~~~~~~ > > + > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > +This error code means, among other things, that a device reset has happened and > > +it needs to recreate the contexts to keep going. > > + > > +Reporting causes of resets > > +-------------------------- > > + > > +Apart from propagating the reset through the stack so apps can recover, it's > > +really useful for driver developers to learn more about what caused the reset in > > +first place. DRM devices should make use of devcoredump to store relevant > > +information about the reset, so this information can be added to user bug > > +reports. > > + > > .. _drm_driver_ioctl: > > > > IOCTL Support on Device Nodes > > -- > > 2.41.0 > > >