Re: [PATCH v4 1/3] drm: Introduce device wedged event

Raag Jadav <raag.jadav@xxxxxxxxx> · Tue, 10 Sep 2024 18:49:00 +0300

On Mon, Sep 09, 2024 at 02:53:23PM -0700, Matt Roper wrote:
> On Fri, Sep 06, 2024 at 03:12:23PM +0530, Raag Jadav wrote:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is in unrecoverable state
> > and requires userspace intervention for recovery.
> > 
> > Purpose of this implementation is to be vendor agnostic. Userspace
> > consumers (sysadmin) can define udev rules to parse this event and
> > take respective action to recover the device.
> > 
> > Consumer expectations:
> > ----------------------
> > 1) Unbind driver
> > 2) Reset bus device
> > 3) Re-bind driver
> > 
> > v4: s/drm_dev_wedged/drm_dev_wedged_event
> >     Use drm_info() (Jani)
> >     Kernel doc adjustment (Aravind)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@xxxxxxxxx>
> > ---
> >  drivers/gpu/drm/drm_drv.c | 20 ++++++++++++++++++++
> >  include/drm/drm_drv.h     |  1 +
> >  2 files changed, 21 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 93543071a500..cca5d8295eb7 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -499,6 +499,26 @@ void drm_dev_unplug(struct drm_device *dev)
> >  }
> >  EXPORT_SYMBOL(drm_dev_unplug);
> >  
> > +/**
> > + * drm_dev_wedged_event - generate a device wedged uevent
> > + * @dev: DRM device
> > + *
> > + * This generates a device wedged uevent for the DRM device specified by @dev,
> > + * on the basis of which, userspace may take respective action to recover the
> > + * device. Currently we only set WEDGED=1 in the uevent environment, but this
> > + * can be expanded in the future.
> 
> Just to clarify, is "wedged" intended to always mean "the entire device
> is unusable" or are there cases where it would also get sent if only
> part of the device is in a bad state?  For example, using i915/Xe
> terminology, maybe the GT is dead but display is still working.  Or one
> GT is dead, but another is still alive.

The idea is to provide drivers a way to recover through userspace intervention.
It is upto the drivers to decide when they see the need for recovery and how
they want to recover.

> Basically, is this event intended as a signal that userspace should stop
> trying to do _anything_ with the device, or just that the device has
> degraded functionality in some way (and maybe userspace can still do
> something useful if it's lucky)?  It would be good to clarify that in
> the docs here in case different drivers have different ideas about how
> this is expected to work.

And hence the open discussion. Improvements are welcome :)

Raag