Re: [PATCH v7 1/5] drm: Introduce device wedged event

Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> · Thu, 24 Oct 2024 13:48:48 -0400

On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote:
> On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@xxxxxxxxxx> wrote:
> >
> > Em 18/10/2024 12:31, Alex Deucher escreveu:
> > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote:
> > >>
> > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> > >>> Hi Raag,
> > >>>
> > >>> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > >>>> Introduce device wedged event, which will notify userspace of wedged
> > >>>> (hanged/unusable) state of the DRM device through a uevent. This is
> > >>>> useful especially in cases where the device is no longer operating as
> > >>>> expected even after a hardware reset and has become unrecoverable from
> > >>>> driver context.
> > >>>>
> > >>>> Purpose of this implementation is to provide drivers a generic way to
> > >>>> recover with the help of userspace intervention. Different drivers may
> > >>>> have different ideas of a "wedged device" depending on their hardware
> > >>>> implementation, and hence the vendor agnostic nature of the event.
> > >>>> It is up to the drivers to decide when they see the need for recovery
> > >>>> and how they want to recover from the available methods.
> > >>>>
> > >>>> Current implementation defines three recovery methods, out of which,
> > >>>> drivers can choose to support any one or multiple of them. Preferred
> > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>.
> > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event
> > >>>> and take respective action to recover the device.
> > >>>>
> > >>>>       =============== ==================================
> > >>>>       Recovery method Consumer expectations
> > >>>>       =============== ==================================
> > >>>>       rebind          unbind + rebind driver
> > >>>>       bus-reset       unbind + reset bus device + rebind
> > >>>>       reboot          reboot system
> > >>>>       =============== ==================================
> > >>>>
> > >>>>
> > >>>
> > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/
> > >>>
> > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and
> > >>> there was just a black screen. The uevent would then trigger a daemon to
> > >>> reset the compositor and getting things back together. As you can see in my
> > >>> thread, the feature was blocked in favor of getting better overall GPU reset
> > >>> from the kernel side.
> > >>>
> > >>> Which kind of scenarios are making i915/xe the need to have userspace
> > >>> involvement? I tested a bunch of resets in i915 but never managed to get the
> > >>> driver stuck.
> > >>
> > >> 2 scenarios:
> > >>
> > >> 1. Multiple levels of reset has failed and device was declared wedged. This is
> > >> rare indeed as the resets improved a lot.
> > >> 2. Debug case. We can boot the driver with option to declare device wedged at
> > >> any timeout, so the device can be debugged.
> > >>
> > >>>
> > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> > >>> intervention.
> > >>
> > >> How do you trigger that?
> > >
> > > What do you mean by bus reset?  I think Chrisitian is just referring
> > > to a full adapter reset (as opposed to a queue reset or something more
> > > fine grained).  Driver can reset the device via MMIO or firmware,
> > > depending on the device.  I think there are also PCI helpers for
> > > things like PCI FLR.
> > >
> >
> > I was referring to AMD_RESET_PCI:
> >
> > "Does a full bus reset using core Linux subsystem PCI reset and does a
> > secondary bus reset or FLR, depending on what the underlying hardware
> > supports."
> >
> > And that can be triggered by using `amdgpu_reset_method=5` as the module
> > option.
> >
> 
> That option doesn't actually do anything useful on most AMD GPUs.  We
> don't support FLR on most boards and SBR doesn't work once the driver
> has been loaded except for really old chips.  That said, internally
> these all end up being mode1 or mode2 resets which the driver can
> trigger directly and which are the defaults.

okay, this is the same for us then.
And this is the main reason that we have this option:
- unbind + reset bus device + rebind

unbind by itself needs to be a supported and working case regardless
the reset state. Then this sequence should be fine.

Afaik there's no way that the driver itself could call for the bus
reset.

> 
> Alex