On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote: > > Create a section that specifies how to deal with DRM device resets for > > kernel and userspace drivers. > > > > Acked-by: Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx> > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> > > --- > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@xxxxxxxxxx/ > > > > Changes: > > - Grammar fixes (Randy) > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 68 insertions(+) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 65fb3036a580..3cbffa25ed93 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > mmapped regular files. Threads cause additional pain with signal > > handling as well. > > > > +Device reset > > +============ > > + > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > +faulty applications and everything in between the many layers. Some errors > > +require resetting the device in order to make the device usable again. This > > +sections describes the expectations for DRM and usermode drivers when a > > +device resets and how to propagate the reset status. > > + > > +Kernel Mode Driver > > +------------------ > > + > > +The KMD is responsible for checking if the device needs a reset, and to perform > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > +should keep track of resets, because userspace can query any time about the > > +reset stats for an specific context. This is needed to propagate to the rest of > > +the stack that a reset has happened. Currently, this is implemented by each > > +driver separately, with no common DRM interface. > > + > > +User Mode Driver > > +---------------- > > + > > +The UMD should check before submitting new commands to the KMD if the device has > > +been reset, and this can be checked more often if the UMD requires it. After > > +detecting a reset, UMD will then proceed to report it to the application using > > +the appropriate API error code, as explained in the section below about > > +robustness. > > + > > +Robustness > > +---------- > > + > > +The only way to try to keep an application working after a reset is if it > > +complies with the robustness aspects of the graphical API that it is using. > > + > > +Graphical APIs provide ways to applications to deal with device resets. However, > > +there is no guarantee that the app will use such features correctly, and the > > +UMD can implement policies to close the app if it is a repeating offender, > > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat > offender killing is more a policy where the kernel enforces policy, and no > longer up to userspace to dtrt (because very clearly userspace is not > really doing the right thing anymore when it's just hanging the gpu in an > endless loop). Also maybe tune it down further to something like "the > kernel driver may implemnent ..." > > In my opinion the umd shouldn't implement these kind of magic guesses, the > entire point of robustness apis is to delegate responsibility for > correctly recovering to the application. And the kernel is left with > enforcing fair resource usage policies (which eventually might be a > cgroups limit on how much gpu time you're allowed to waste with gpu > resets). Killing apps that the kernel thinks are misbehaving really doesn't seem like a good idea to me. What if the process is a service getting restarted after getting killed? What if killing that process leaves the system in a bad state? Can't the kernel provide some information to user space so that e.g. systemd can handle those situations? > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > +the user interface from being correctly displayed. This should be done even if > > +the app is correct but happens to trigger some bug in the hardware/driver. > > + > > +OpenGL > > +~~~~~~ > > + > > +Apps using OpenGL should use the available robust interfaces, like the > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > +interface tells if a reset has happened, and if so, all the context state is > > +considered lost and the app proceeds by creating new ones. If it is possible to > > +determine that robustness is not in use, the UMD will terminate the app when a > > +reset is detected, giving that the contexts are lost and the app won't be able > > +to figure this out and recreate the contexts. > > + > > +Vulkan > > +~~~~~~ > > + > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > +This error code means, among other things, that a device reset has happened and > > +it needs to recreate the contexts to keep going. > > + > > +Reporting causes of resets > > +-------------------------- > > + > > +Apart from propagating the reset through the stack so apps can recover, it's > > +really useful for driver developers to learn more about what caused the reset in > > +first place. DRM devices should make use of devcoredump to store relevant > > +information about the reset, so this information can be added to user bug > > +reports. > > Since we do not seem to have a solid consensus in the community about > non-robust userspace, maybe we could just document that lack of consensus > to unblock this patch? Something like this: > > Non-Robust Userspace > -------------------- > > Userspace that doesn't support robust interfaces (like an non-robust > OpenGL context or API without any robustness support like libva) leave the > robustness handling entirely to the userspace driver. There is no strong > community consensus on what the userspace driver should do in that case, > since all reasonable approaches have some clear downsides. > > With the s/UMD/KMD/ further up and maybe something added to record the > non-robustness non-consensus: > > Acked-by: Daniel Vetter <daniel.vetter@xxxxxxxx> > > Cheers, Daniel > > > > > + > > .. _drm_driver_ioctl: > > > > IOCTL Support on Device Nodes > > -- > > 2.41.0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch >