Re: [PATCH v7 1/5] drm: Introduce device wedged event

Alex Deucher <alexdeucher@xxxxxxxxx> · Fri, 18 Oct 2024 11:31:17 -0400

On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote:
>
> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > Hi Raag,
> >
> > Em 30/09/2024 04:38, Raag Jadav escreveu:
> > > Introduce device wedged event, which will notify userspace of wedged
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected even after a hardware reset and has become unrecoverable from
> > > driver context.
> > >
> > > Purpose of this implementation is to provide drivers a generic way to
> > > recover with the help of userspace intervention. Different drivers may
> > > have different ideas of a "wedged device" depending on their hardware
> > > implementation, and hence the vendor agnostic nature of the event.
> > > It is up to the drivers to decide when they see the need for recovery
> > > and how they want to recover from the available methods.
> > >
> > > Current implementation defines three recovery methods, out of which,
> > > drivers can choose to support any one or multiple of them. Preferred
> > > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > > Userspace consumers (sysadmin) can define udev rules to parse this event
> > > and take respective action to recover the device.
> > >
> > >      =============== ==================================
> > >      Recovery method Consumer expectations
> > >      =============== ==================================
> > >      rebind          unbind + rebind driver
> > >      bus-reset       unbind + reset bus device + rebind
> > >      reboot          reboot system
> > >      =============== ==================================
> > >
> > >
> >
> > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/
> >
> > The motivation was that amdgpu was getting stuck after every GPU reset, and
> > there was just a black screen. The uevent would then trigger a daemon to
> > reset the compositor and getting things back together. As you can see in my
> > thread, the feature was blocked in favor of getting better overall GPU reset
> > from the kernel side.
> >
> > Which kind of scenarios are making i915/xe the need to have userspace
> > involvement? I tested a bunch of resets in i915 but never managed to get the
> > driver stuck.
>
> 2 scenarios:
>
> 1. Multiple levels of reset has failed and device was declared wedged. This is
> rare indeed as the resets improved a lot.
> 2. Debug case. We can boot the driver with option to declare device wedged at
> any timeout, so the device can be debugged.
>
> >
> > For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > intervention.
>
> How do you trigger that?

What do you mean by bus reset?  I think Chrisitian is just referring
to a full adapter reset (as opposed to a queue reset or something more
fine grained).  Driver can reset the device via MMIO or firmware,
depending on the device.  I think there are also PCI helpers for
things like PCI FLR.

Alex