On Tue, 28 Jan 2025 11:37:53 +0200 Raag Jadav <raag.jadav@xxxxxxxxx> wrote: > On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote: > > On Wed, 22 Jan 2025 07:22:25 +0200 > > Raag Jadav <raag.jadav@xxxxxxxxx> wrote: > > > > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote: > > > > > +It is the responsibility of the consumer to make sure that the device or > > > > > +its resources are not in use by any process before attempting recovery. > > > > I'm not convinced this is actually doable in practice, outside of > > > > killing all apps that aren't the one trying to recover the GPU. > > > > Is this just about not crashing those processes if they don't handle > > > > GPU hotunplugs well, about leaks, or something else? > > > > > > Correct, all of it. And since the compositor is in charge of device resources, > > > this way it atleast has the opportunity to recover the device and recreate > > > context without all the userspace violence. > > > > Hi Raag, > > > > sorry, I haven't followed this series, so I wonder, why should > > userspace be part of recovering the device? Why doesn't the kernel > > automatically load a new driver instance with a new DRM device node? > > There are things like bus level reset (PCI SBR) and re-enumeration that are > not possible from driver context (or atleast I'm not aware of it), so a new > instance is just as useful/less as the old one. Ok, "not possible from driver context" is a key revelation. I wonder if starting an overview section with that in the documentation would help getting the right mindset. Did I miss that somewhere? I thought bus level reset meant resetting the PCI device by some bus API. Clearly mistaken, I suppose you mean resetting the whole bus including everything on it? > > I see in the commit message written: > > > > "For example, if the driver supports multiple recovery methods, > > consumers can opt for the suitable one based on policy > > definition." > > > > How could consumers know what to do? How can they guess what would be > > enough to recover the device? Isn't that the kernel driver's job to > > know? > > Yes, 'WEDGED=' value are the known methods that are expected to work. The > policy is how the consumer can decide which one to opt for depending on the > scenario. For example, the less drastic method could work in most cases, but > you'd probably want to opt for a more drastic method for repeat offences or > perhaps if something more serious is discovered from "optional telemetry > collection". Aha, cool. > > (More important for userspace would be know if dmabuf fds remain > > pointing to valid memory retaining its contents or if the contents are > > lost. Userspace cannot tell which device a dmabuf originates from, > > AFAIK, so this would need to be added in the generic dmabuf UAPI.) > > Not sure if I understand, perhaps Christian can shed some light here. A system might have multiple GPUs, and one GPU going down may leave all the rest working as usual. A Wayland compositor would want to tell the difference between still-good and possibly or definitely lost dmabufs it received from its clients. But this is off-topic in this thread I believe, nothing to the series at hand. > > > > > +With IOCTLs blocked and device already 'wedged', all device memory should > > > > btw. when I see "blocked" I think of the function call not returning > > yet. But in this patch "blocked" seems to be synonymous for "returns > > an error immediately". Would it be possible to avoid the word "blocked" > > for this? > > It is meant as "blocking the access", but fair enough. We can have a quick > fix later on if it's not too big of a concern. Sure, I don't mind. > > > > > +be unmapped and file descriptors should be closed to prevent leaks. > > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug + > > > > hotplug with matching "remove" and "add" udev events. As long as the > > > > application cleans up resources related to the device when it receives > > > > the event, there should be no leaks with a normal hotunplug... Is this > > > > different enough that we can't have the same expectations? > > > > > > The thing about "remove" event is that it is generated *after* we opt for an > > > unbind, and at that point it might be already too late if userspace doesn't > > > get enough time to clean things up while the device is removed with a live > > > client resulting in unknown consequences. > > > > > > The idea here is to clean things up *before* we opt for an unbind leaving > > > no room for side effects. > > > > Something here feels fragile. There should not be a deadline for > > userspace to finish cleaning up. What was described for KMS device nodes > > in this same document seems like a more reliable approach: keep the > > dead driver instance around until userspace has closed all references > > to it. The device node could be removed earlier. > > I'm not sure if I'm following here. The driver instance will exist as long > as the dead device exists, which the consumer can remove if/when it chooses > to trigger an unbind from userspace. There is no deadline for it. I was going by your words: "it might be already too late if userspace doesn't get enough time to clean things up". > The consumer can choose to rely on hotplug events if it wishes, but the point > here is that it doesn't guarantee a clean recovery in all cases. Clearly I don't understand the whole picture here. No worries, nevermind. Thanks, pq
Attachment:
pgpdOGoYsgVw_.pgp
Description: OpenPGP digital signature