Re: [PATCH v7 1/5] drm: Introduce device wedged event

André Almeida <andrealmeid@xxxxxxxxxx> · Fri, 18 Oct 2024 14:56:17 -0300

Em 18/10/2024 12:31, Alex Deucher escreveu:
On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote:

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
Hi Raag,

Em 30/09/2024 04:38, Raag Jadav escreveu:
Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a hardware reset and has become unrecoverable from
driver context.

Purpose of this implementation is to provide drivers a generic way to
recover with the help of userspace intervention. Different drivers may
have different ideas of a "wedged device" depending on their hardware
implementation, and hence the vendor agnostic nature of the event.
It is up to the drivers to decide when they see the need for recovery
and how they want to recover from the available methods.

Current implementation defines three recovery methods, out of which,
drivers can choose to support any one or multiple of them. Preferred
recovery method will be sent in the uevent environment as WEDGED=<method>.
Userspace consumers (sysadmin) can define udev rules to parse this event
and take respective action to recover the device.

      =============== ==================================
      Recovery method Consumer expectations
      =============== ==================================
      rebind          unbind + rebind driver
      bus-reset       unbind + reset bus device + rebind
      reboot          reboot system
      =============== ==================================

I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/

The motivation was that amdgpu was getting stuck after every GPU reset, and
there was just a black screen. The uevent would then trigger a daemon to
reset the compositor and getting things back together. As you can see in my
thread, the feature was blocked in favor of getting better overall GPU reset
from the kernel side.

Which kind of scenarios are making i915/xe the need to have userspace
involvement? I tested a bunch of resets in i915 but never managed to get the
driver stuck.

2 scenarios:

1. Multiple levels of reset has failed and device was declared wedged. This is
rare indeed as the resets improved a lot.
2. Debug case. We can boot the driver with option to declare device wedged at
any timeout, so the device can be debugged.

For the bus-reset, amdgpu does that too, but it doesn't require userspace
intervention.

How do you trigger that?

What do you mean by bus reset?  I think Chrisitian is just referring
to a full adapter reset (as opposed to a queue reset or something more
fine grained).  Driver can reset the device via MMIO or firmware,
depending on the device.  I think there are also PCI helpers for
things like PCI FLR.

I was referring to AMD_RESET_PCI:

"Does a full bus reset using core Linux subsystem PCI reset and does a 
secondary bus reset or FLR, depending on what the underlying hardware 
supports."

And that can be triggered by using `amdgpu_reset_method=5` as the module 
option.