Re: [PATCH v8 2/4] drm/doc: Document device wedged event

Raag Jadav <raag.jadav@xxxxxxxxx> · Fri, 1 Nov 2024 06:26:21 +0200

On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote:
> Am 25.10.24 um 10:48 schrieb Raag Jadav:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions and consumer expectations
> > along with an example.
> > 
> > v8: Improve documentation (Christian, Rodrigo)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@xxxxxxxxx>
> > ---
> >   Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
> >   1 file changed, 75 insertions(+)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 370d820be248..11a7446233b5 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
> >   information about the reset, so this information can be added to user bug
> >   reports.
> > +Device wedging
> > +==============
> > +
> > +Drivers can optionally make use of device wedged event (implemented as
> > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> > +(hanged/unusable) state of the DRM device through a uevent. This is useful
> > +especially in cases where the device is no longer operating as expected even
> > +after a reset and has become unrecoverable from driver context. Purpose of
> > +this implementation is to provide drivers a generic way to recover with the
> > +help of userspace intervention without taking any drastic measures in the
> > +driver.
> > +
> > +A 'wedged' device is basically a dead device that needs attention. The
> > +uevent is the notification that is sent to userspace along with a hint about
> > +what could possibly be attempted to recover the device and bring it back to
> > +usable state. Different drivers may have different ideas of a 'wedged' device
> > +depending on their hardware implementation, and hence the vendor agnostic
> > +nature of the event. It is up to the drivers to decide when they see the need
> > +for recovery and how they want to recover from the available methods.
> > +
> > +Recovery
> > +--------
> > +
> > +Current implementation defines two recovery methods, out of which, drivers
> > +can use any one, both or none. Method(s) of choice will be sent in the uevent
> > +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
> > +effects. If driver is unsure about recovery or method is unknown (like reboot,
> > +firmware flashing, hardware replacement or any other procedure which can't be
> > +attempted on the fly), ``WEDGED=none`` will be sent instead.
> > +
> > +It is the responsibility of the driver to perform required cleanups (like
> > +disabling system memory access or signalling dma_fences) and prepare itself
> > +for the recovery before sending the event.
> 
> That needs to be more like a warning and should have a bit more text. Maybe
> even a separate section.
> 
> Something like this maybe:
> 
> Prerequisites
> ------------------
> 
> Drivers needs to make sure that the device won't harm the system as a
> whole by keeping it in a wedged state. Necessary actions must include
> disabling DMA to system memory as well as communication channels
> with other devices.
> Further drivers must ensure that all dma_fences are signaled and other
> state the core kernel might depend on are cleaned up.
> All ongoing waiting for hardware state changes must be aborted and
> new accesses to the hardware sufficiently blocked....
> 
> 
> I'm not a native speaker of English, so feel free to re-phrase that. But the

Neither am I, but will try my best.

> general approach should be like don't do this before you have made sure a, b
> and c.

Sure, thanks.

> >   Once the event is sent, driver
> > +should block all IOCTLs with an error code.
> 
> Better define which error code that should be. I think -ENODEV similar to a
> hotplug case would be appropriate.

Why not leave it to the drivers to decide depending on the type of failure?

> >   This will signify the reason for
> > +wegeding which can be reported to the application if needed.
> > +
> > +Userspace consumers can parse this event and attempt recovery as per below
> > +expectations.
> > +
> > +    =============== ==================================
> > +    Recovery method Consumer expectations
> > +    =============== ==================================
> > +    rebind          unbind + rebind driver
> > +    bus-reset       unbind + reset bus device + rebind
> > +    none            admin/user policy
> > +    =============== ==================================
> > +
> > +Example for rebind
> > +~~~~~~~~~~~~~~~~~~
> > +
> > +Udev rule::
> > +
> > +    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> > +    RUN+="/path/to/rebind.sh $env{DEVPATH}"
> > +
> > +Recovery script::
> > +
> > +    #!/bin/sh
> > +
> > +    DEVPATH=$(readlink -f /sys/$1/device)
> > +    DEVICE=$(basename $DEVPATH)
> > +    DRIVER=$(readlink -f $DEVPATH/driver)
> > +
> > +    echo -n $DEVICE > $DRIVER/unbind
> > +    sleep 1
> > +    echo -n $DEVICE > $DRIVER/bind
> > +
> > +Although scripts are simple enough for basic recovery, admin/users can define
> > +customized policies around recovery action. For example if the driver supports
> > +multiple recovery methods, consumers can opt for the suitable one based on
> > +policy definition. Consumers can also take additional steps like gathering
> > +telemetry information (devcoredump, syslog)
> 
> I'm really wondering if we shouldn't standardize successful resets with this
> event as well?
> 
> E.g. like there was an issue with device X, please collect the devcoredump.

This seems to fit into WEDGED=none case, although with this we'd probably
need to redefine what 'wedged' means.

> And then as a second step have the WEDGED property to indicate:
> a) reset successful, no actions needed.
> b) detach and rebind from the bus
> c) bus-reset
> d) impossible to recover but system as a whole can continue to work.
> e) it's on fire! Run sync and shut down power immediately.

Raag