Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation

Jason Gunthorpe <jgg@xxxxxxxxxx> · Thu, 27 Feb 2025 10:23:49 -0400

On Wed, Feb 26, 2025 at 05:02:23PM -0800, Greg KH wrote:
> On Wed, Feb 26, 2025 at 07:47:30PM -0400, Jason Gunthorpe wrote:
> > The way misc device works you can't unload the module until all the
> > FDs are closed and the misc code directly handles races with opening
> > new FDs while modules are unloading. It is quite a different scheme
> > than discussed in this thread.
> 
> And I would argue that is it the _right_ scheme to be following overall
> here.  Removing modules with in-flight devices/drivers is to me is odd,
> and only good for developers doing work, not for real systems, right?

There are two issues and I've found these discussions get confused
about two interrelated things:

1) Module lifetime and when modules are refcounted
2) How does device_driver remove() work, especially with hot unplug
   and /sys/../unbind while the module is *still loaded*.

Noting, very explicitly, that you can unbind a device_driver without
unloading the module.

#1 should be strictly based around the needs of function pointers in
the system. Ie stuff like ".owner = THIS_MODULE".

#2 is challenging when the driver has a file descriptor.

AFAIK there are only two broad choices:
 a) wait for all FDs to close in remove() (boo!)
 b) leave the FDs open but disable them and complete remove(). eg
    return -ENODEV to all system calls

I think the kernel community has a strong preference for (b), but rdma
had started with (a) long ago. So we fixed it to (b), netdev does (b),
so do alot of places because (a) is, frankly, awful.

Now.. how does that relate to module unbinding? The drivers are
unbound now because we properly support hotunplug via (b). So when is
it OK to allow a module with no bound drivers to remove while a zombie
FD is still open?

That largely revolves around who owns the struct file_operations. For
misc_dev the driver module would own it, so it is impossible to unload
the driver module even if the device driver was hot unplugged/unbound.

For a subsystem, like rdma, the subsystem can own the
file_operations. Now to allow the driver module to be unloaded we
"simply" require the subsystem to fence all driver callbacks during
device driver remove and subsystem unregister. ie if the subsystem
knows it no longer can call the driver then it no longer needs a
refcount on the driver module.

This fence was necessary anyhow for RDMA because drivers had the
pre-existing assumption that unregister was fencing all driver
callbacks by waiting for the FDs to close. Drivers did not handle UAF
races with something like pci_iounmp() and their concurrent driver
callback threads.

Once the fence was built it was straightforward to also allow driver
module unload since the core code has NULL'd its copy of all the
driver function pointers during unregister.

Further, I'd argue this is the best model for subsystems to
follow. Allowing driver code to continue to run after subsystem
unregister forces the driver to deal with UAF removal races. This is
too hard for drivers to implement correctly, and prevents unloading
the driver module after the drivers have been unbound.

Why do people care? Aside from obvious hot-unplug cases, like physical
PCI hot plug on high-avaibility servers hated (a), there was a strong
desire from folks running software HA schemes to be able to upgrade
the driver module with minimal hits. They want to leave the
application running and it is able to fast-recover when the FD becomes
-ENODEV by opening a new one and keeping most of their internal state
alive.

> What is the requirement that means that you have to do this for function
> pointers? 

I'm just pointing out that function pointers are not guaranteed to be
valid forever in the linux model. Every function pointer is somehow
being protected by a lifecycle that links back to the module
lifecycle.

Most of the time a driver author can ignore function pointer lifecycle
analysis, but not always..

Jason