On Fri, Oct 18, 2024 at 05:07:22PM -0400, Alex Deucher wrote: > On Fri, Oct 18, 2024 at 1:56 PM André Almeida <andrealmeid@xxxxxxxxxx> wrote: > > > > Em 18/10/2024 12:31, Alex Deucher escreveu: > > > On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote: > > >> > > >> On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > > >>> Hi Raag, > > >>> > > >>> Em 30/09/2024 04:38, Raag Jadav escreveu: > > >>>> Introduce device wedged event, which will notify userspace of wedged > > >>>> (hanged/unusable) state of the DRM device through a uevent. This is > > >>>> useful especially in cases where the device is no longer operating as > > >>>> expected even after a hardware reset and has become unrecoverable from > > >>>> driver context. > > >>>> > > >>>> Purpose of this implementation is to provide drivers a generic way to > > >>>> recover with the help of userspace intervention. Different drivers may > > >>>> have different ideas of a "wedged device" depending on their hardware > > >>>> implementation, and hence the vendor agnostic nature of the event. > > >>>> It is up to the drivers to decide when they see the need for recovery > > >>>> and how they want to recover from the available methods. > > >>>> > > >>>> Current implementation defines three recovery methods, out of which, > > >>>> drivers can choose to support any one or multiple of them. Preferred > > >>>> recovery method will be sent in the uevent environment as WEDGED=<method>. > > >>>> Userspace consumers (sysadmin) can define udev rules to parse this event > > >>>> and take respective action to recover the device. > > >>>> > > >>>> =============== ================================== > > >>>> Recovery method Consumer expectations > > >>>> =============== ================================== > > >>>> rebind unbind + rebind driver > > >>>> bus-reset unbind + reset bus device + rebind > > >>>> reboot reboot system > > >>>> =============== ================================== > > >>>> > > >>>> > > >>> > > >>> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/ > > >>> > > >>> The motivation was that amdgpu was getting stuck after every GPU reset, and > > >>> there was just a black screen. The uevent would then trigger a daemon to > > >>> reset the compositor and getting things back together. As you can see in my > > >>> thread, the feature was blocked in favor of getting better overall GPU reset > > >>> from the kernel side. > > >>> > > >>> Which kind of scenarios are making i915/xe the need to have userspace > > >>> involvement? I tested a bunch of resets in i915 but never managed to get the > > >>> driver stuck. > > >> > > >> 2 scenarios: > > >> > > >> 1. Multiple levels of reset has failed and device was declared wedged. This is > > >> rare indeed as the resets improved a lot. > > >> 2. Debug case. We can boot the driver with option to declare device wedged at > > >> any timeout, so the device can be debugged. > > >> > > >>> > > >>> For the bus-reset, amdgpu does that too, but it doesn't require userspace > > >>> intervention. > > >> > > >> How do you trigger that? > > > > > > What do you mean by bus reset? I think Chrisitian is just referring > > > to a full adapter reset (as opposed to a queue reset or something more > > > fine grained). Driver can reset the device via MMIO or firmware, > > > depending on the device. I think there are also PCI helpers for > > > things like PCI FLR. > > > > > > > I was referring to AMD_RESET_PCI: > > > > "Does a full bus reset using core Linux subsystem PCI reset and does a > > secondary bus reset or FLR, depending on what the underlying hardware > > supports." > > > > And that can be triggered by using `amdgpu_reset_method=5` as the module > > option. > > > > That option doesn't actually do anything useful on most AMD GPUs. We > don't support FLR on most boards and SBR doesn't work once the driver > has been loaded except for really old chips. That said, internally > these all end up being mode1 or mode2 resets which the driver can > trigger directly and which are the defaults. okay, this is the same for us then. And this is the main reason that we have this option: - unbind + reset bus device + rebind unbind by itself needs to be a supported and working case regardless the reset state. Then this sequence should be fine. Afaik there's no way that the driver itself could call for the bus reset. > > Alex