On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote: > Am 25.10.24 um 10:48 schrieb Raag Jadav: > > Add documentation for device wedged event in a new 'Device wedging' > > chapter. The describes basic definitions and consumer expectations > > along with an example. > > > > v8: Improve documentation (Christian, Rodrigo) > > > > Signed-off-by: Raag Jadav <raag.jadav@xxxxxxxxx> > > --- > > Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 75 insertions(+) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 370d820be248..11a7446233b5 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant > > information about the reset, so this information can be added to user bug > > reports. > > +Device wedging > > +============== > > + > > +Drivers can optionally make use of device wedged event (implemented as > > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged > > +(hanged/unusable) state of the DRM device through a uevent. This is useful > > +especially in cases where the device is no longer operating as expected even > > +after a reset and has become unrecoverable from driver context. Purpose of > > +this implementation is to provide drivers a generic way to recover with the > > +help of userspace intervention without taking any drastic measures in the > > +driver. > > + > > +A 'wedged' device is basically a dead device that needs attention. The > > +uevent is the notification that is sent to userspace along with a hint about > > +what could possibly be attempted to recover the device and bring it back to > > +usable state. Different drivers may have different ideas of a 'wedged' device > > +depending on their hardware implementation, and hence the vendor agnostic > > +nature of the event. It is up to the drivers to decide when they see the need > > +for recovery and how they want to recover from the available methods. > > + > > +Recovery > > +-------- > > + > > +Current implementation defines two recovery methods, out of which, drivers > > +can use any one, both or none. Method(s) of choice will be sent in the uevent > > +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side > > +effects. If driver is unsure about recovery or method is unknown (like reboot, > > +firmware flashing, hardware replacement or any other procedure which can't be > > +attempted on the fly), ``WEDGED=none`` will be sent instead. > > + > > +It is the responsibility of the driver to perform required cleanups (like > > +disabling system memory access or signalling dma_fences) and prepare itself > > +for the recovery before sending the event. > > That needs to be more like a warning and should have a bit more text. Maybe > even a separate section. > > Something like this maybe: > > Prerequisites > ------------------ > > Drivers needs to make sure that the device won't harm the system as a > whole by keeping it in a wedged state. Necessary actions must include > disabling DMA to system memory as well as communication channels > with other devices. > Further drivers must ensure that all dma_fences are signaled and other > state the core kernel might depend on are cleaned up. > All ongoing waiting for hardware state changes must be aborted and > new accesses to the hardware sufficiently blocked.... > > > I'm not a native speaker of English, so feel free to re-phrase that. But the Neither am I, but will try my best. > general approach should be like don't do this before you have made sure a, b > and c. Sure, thanks. > > Once the event is sent, driver > > +should block all IOCTLs with an error code. > > Better define which error code that should be. I think -ENODEV similar to a > hotplug case would be appropriate. Why not leave it to the drivers to decide depending on the type of failure? > > This will signify the reason for > > +wegeding which can be reported to the application if needed. > > + > > +Userspace consumers can parse this event and attempt recovery as per below > > +expectations. > > + > > + =============== ================================== > > + Recovery method Consumer expectations > > + =============== ================================== > > + rebind unbind + rebind driver > > + bus-reset unbind + reset bus device + rebind > > + none admin/user policy > > + =============== ================================== > > + > > +Example for rebind > > +~~~~~~~~~~~~~~~~~~ > > + > > +Udev rule:: > > + > > + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", > > + RUN+="/path/to/rebind.sh $env{DEVPATH}" > > + > > +Recovery script:: > > + > > + #!/bin/sh > > + > > + DEVPATH=$(readlink -f /sys/$1/device) > > + DEVICE=$(basename $DEVPATH) > > + DRIVER=$(readlink -f $DEVPATH/driver) > > + > > + echo -n $DEVICE > $DRIVER/unbind > > + sleep 1 > > + echo -n $DEVICE > $DRIVER/bind > > + > > +Although scripts are simple enough for basic recovery, admin/users can define > > +customized policies around recovery action. For example if the driver supports > > +multiple recovery methods, consumers can opt for the suitable one based on > > +policy definition. Consumers can also take additional steps like gathering > > +telemetry information (devcoredump, syslog) > > I'm really wondering if we shouldn't standardize successful resets with this > event as well? > > E.g. like there was an issue with device X, please collect the devcoredump. This seems to fit into WEDGED=none case, although with this we'd probably need to redefine what 'wedged' means. > And then as a second step have the WEDGED property to indicate: > a) reset successful, no actions needed. > b) detach and rebind from the bus > c) bus-reset > d) impossible to recover but system as a whole can continue to work. > e) it's on fire! Run sync and shut down power immediately. Raag