Re: [PATCH 0/3] drm/i915: Fix harmfull driver register/unregister assymetry

Lucas De Marchi <lucas.demarchi@xxxxxxxxx> · Thu, 13 Feb 2025 11:15:40 -0600

On Thu, Feb 13, 2025 at 03:33:08PM +0100, Krzysztof Niemiec wrote:
On 2025-02-12 at 16:35:16 GMT, Andi Shyti wrote:
Hi Krzysztof,

On Wed, Feb 12, 2025 at 12:50:17PM +0100, Krzysztof Niemiec wrote:
> On 2025-02-10 at 14:01:19 GMT, Andi Shyti wrote:
> > On Thu, Feb 06, 2025 at 07:07:38PM +0100, Janusz Krzysztofik wrote:
> > > We return immediately from i915_driver_register() if drm_dev_register()
> > > fails, skipping remaining registration steps.  However, the _unregister()
> > > counterpart called at device remove knows nothing about that skip and
> > > executes reverts for all those steps.  For that to work correctly, those
> > > revert functions must be resistant to being called even on uninitialized
> > > objects, or we must not skip their initialization.
> > >
> > > Three cases have been identified and fixes proposed.  Call traces are
> > > taken from CI results of igt@i915_driver_load@reload-with-fault-injection
> > > execution, reported to several separate Gitlab issues (links provided).
> > >
> > > Immediate return was introduced to i915_driver_register() by commit
> > > ec3e00b4ee27 ("drm/i915: stop registering if drm_dev_register() fails"),
> > > however, quite a few things have changed since then.  That's why I haven't
> > > mentioned it in a Fixes: tag to avoid it being picked up by stable, which
> > > I haven't tested.
> >
> > I'm not fully convinced about this series as I think that you are
> > fixing a subset of what needs to be handled properly. What about
> > hwmon? What about gt? what about debugfs?
> >
> > In my opinion we need to check in _unregister whether the
> > drm_dev_register has succeded and one way would be, e.g., to
> > check for the drm minor value, or even set the drm device tu NULL
> > (first things that come to my mind, maybe there are smarter ways
> > of doing it). This way we could skip some of the _unregister()
> > steps.
> >
>
> Is there any situation in which having the driver loaded after failing
> drm_dev_register() is of any use? Because if not, instead of recording
> the fact of registration failure, we can just stop the driver from
> loading altogether by checking drm_dev_register()'s return value,
> then calling _only_ the required _unregister steps as cleanup in an
> error path, and propagating the result up to driver probe. This way we
> don't need to store any additional information at all.

as long as the driver works, why pushing it to fail? Janusz's
patch is really showing the case.

Because the driver doesn't really work... it is loaded into the kernel
but you can't access it the expected way because we failed to register
it to userspace, and we also skipped a bunch of registration functions.
Is there really any benefit to keep it loaded in that state?

Also, i915 is the *only* driver in the drm directory that doesn't check
drm_dev_register()'s return value. All other drivers at least check it,
and the ones I took a closer look at (xe, nouveau, amdgpu, radeon,
virtio) propagate errors in drm_dev_register() up to their pci .probe
functions. So I think we're safe to handle it this way, and this
wouldn't force the driver to keep information on whether the
registration succeeded or not (if _unregister ever gets called, it's
because we're exiting from a properly loaded driver, so no functions
are to be skipped)

agreed. In xe we used to check for everything, then this approach of
registering stuff and not checking "because it's not critical enough"
started to show up and made it a nightmare to maintain. This leads to
half-initialized driver that may or may not have debugfs, or sysfs, or
pmu, or hwmon, or drm, etc.

However in xe we are using a different approach. Relevant patch series:
https://lore.kernel.org/intel-xe/20250212193600.475089-1-lucas.demarchi@xxxxxxxxx/

And with this other one extending devres, we can go further and move
more stuff to be handled by devres:
https://lore.kernel.org/all/20250212200542.515493-1-lucas.demarchi@xxxxxxxxx/

The .init()/.exit() calls on the module level that started with i915 is
also used in xe, but it doesn't make much sense in most of the
cases since most of the initialization is per device, not per module.

After those series, we are still not done. There's still some display
stuff to figure out on the init/shutodown sequence and the heci stuff
will probably come after the devres part.

That said, there's at least a few reasons to keep the driver attached:

	1) Survivability mode: to allow communication with mei and
	possibly unbrick a device. Right now in xe we handle this
	separately though: the driver needs to be initialized in this
	mode

	2) devcoredump: if there's a crash to be debugged with the help
	of devcoredump, we can't detach the driver otherwise the dump
	will be gone. I'm considering to allow a failure in
	xe_device_probe() not causing xe_pci_probe() to fail so the user
	can grab the devcoredump. In this case, the driver shouldn't be
	registered with drm or anything else at the end though.

Lucas De Marchi

Andi

Thanks
Krzysztof