Re: [PATCH v10 2/4] drm/doc: Document device wedged event

Raag Jadav <raag.jadav@xxxxxxxxx> · Tue, 28 Jan 2025 11:37:53 +0200

On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> On Wed, 22 Jan 2025 07:22:25 +0200
> Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
> 
> > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > > > +It is the responsibility of the consumer to make sure that the device or
> > > > +its resources are not in use by any process before attempting recovery.  
> > > I'm not convinced this is actually doable in practice, outside of
> > > killing all apps that aren't the one trying to recover the GPU.
> > > Is this just about not crashing those processes if they don't handle
> > > GPU hotunplugs well, about leaks, or something else?  
> > 
> > Correct, all of it. And since the compositor is in charge of device resources,
> > this way it atleast has the opportunity to recover the device and recreate
> > context without all the userspace violence.
> 
> Hi Raag,
> 
> sorry, I haven't followed this series, so I wonder, why should
> userspace be part of recovering the device? Why doesn't the kernel
> automatically load a new driver instance with a new DRM device node?

There are things like bus level reset (PCI SBR) and re-enumeration that are
not possible from driver context (or atleast I'm not aware of it), so a new
instance is just as useful/less as the old one.

> Of course userspace needs to deal with stuff suddenly erroring out, and
> destroy existing related resources, then wait for a working device
> to appear and rebuild all state. The kernel driver already needs to
> make the existing open stuff inert and harmless, why does it need an
> acknowledgement from userspace to unbind and re-bind?

Rebind is kind of a stepping stone to the above.

> > I'm not entirely aware of its feasibility though, perhaps something for the
> > consumers to experiment.
> 
> If consumers mean userspace, then no, not reliably. But the kernel can
> do it.

Can you please elaborate or refer to an example?

> I see in the commit message written:
> 
> 	"For example, if the driver supports multiple recovery methods,
> 	consumers can opt for the suitable one based on policy
> 	definition."
> 
> How could consumers know what to do? How can they guess what would be
> enough to recover the device? Isn't that the kernel driver's job to
> know?

Yes, 'WEDGED=' value are the known methods that are expected to work. The
policy is how the consumer can decide which one to opt for depending on the
scenario. For example, the less drastic method could work in most cases, but
you'd probably want to opt for a more drastic method for repeat offences or
perhaps if something more serious is discovered from "optional telemetry
collection".

> (More important for userspace would be know if dmabuf fds remain
> pointing to valid memory retaining its contents or if the contents are
> lost. Userspace cannot tell which device a dmabuf originates from,
> AFAIK, so this would need to be added in the generic dmabuf UAPI.)

Not sure if I understand, perhaps Christian can shed some light here.

> 	"Consumers can also choose to have the device available for
> 	debugging or additional data collection before performing the
> 	recovery."
> 
> Couldn't the wedged driver instance remain detached from the hardware
> while a new driver instance initializes? Then debug data remains until
> the wedged device is fully closed from userspace, or maybe devcore dump
> retains it.
> 
> I presume that WEDGED=none case should retain the debug data somehow as
> well.

Indeed, but it's optional so depends on the driver.

> > > > +With IOCTLs blocked and device already 'wedged', all device memory should
> 
> btw. when I see "blocked" I think of the function call not returning
> yet. But in this patch "blocked" seems to be synonymous for "returns
> an error immediately". Would it be possible to avoid the word "blocked"
> for this?

It is meant as "blocking the access", but fair enough. We can have a quick
fix later on if it's not too big of a concern.

> > > > +be unmapped and file descriptors should be closed to prevent leaks.  
> > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > hotplug with matching "remove" and "add" udev events. As long as the
> > > application cleans up resources related to the device when it receives
> > > the event, there should be no leaks with a normal hotunplug... Is this
> > > different enough that we can't have the same expectations?  
> > 
> > The thing about "remove" event is that it is generated *after* we opt for an
> > unbind, and at that point it might be already too late if userspace doesn't
> > get enough time to clean things up while the device is removed with a live
> > client resulting in unknown consequences.
> > 
> > The idea here is to clean things up *before* we opt for an unbind leaving
> > no room for side effects.
> 
> Something here feels fragile. There should not be a deadline for
> userspace to finish cleaning up. What was described for KMS device nodes
> in this same document seems like a more reliable approach: keep the
> dead driver instance around until userspace has closed all references
> to it. The device node could be removed earlier.

I'm not sure if I'm following here. The driver instance will exist as long
as the dead device exists, which the consumer can remove if/when it chooses
to trigger an unbind from userspace. There is no deadline for it.

The consumer can choose to rely on hotplug events if it wishes, but the point
here is that it doesn't guarantee a clean recovery in all cases.

Raag