On 2023-12-09 16:27, Xu Yilun wrote: >>>>> I feel the scope of the protection is unclear to me in this patch. What >>>>> data should be protected from concurrent access by this mutex? From the >>>>> code seems the racing of mgr dev should be protected but apparently it >>>>> doesn't have to. >>>> >>>> The mutex is there to ensure the lifetime of the manager device and >>>> its context (struct fpga_manager) if fpga_mgr_get() happens to run >>>> concurrently with the removal of the low-level module. >>>> >>>>> >>>>> And with this mutex, the get/put/unregister() for one mgr should be >>>>> exclusive with another mgr, but that also seems not necessary. >>>>> >>>> >>>> I decided to use a static mutex because I thought putting the mutex >>>> in the manager's context would be unsafe since its life would be tied >>>> to the manager's life. For instance, consider the following sequence: >>>> >>>> - A removes the low-level control module, and delete_module progresses >>>> up to the point when it calls the module's exit function, which in turn >>>> calls fpga_mgr_unregister(). >>>> >>>> - fpga_mgr_unregister() takes the mutex but gets descheduled before >>>> completing the unregistering of the manager device. >>>> >>>> - Meanwhile, B wants to get the manager (it is still there) and calls >>>> fpga_mgr_get(), which tries to take the mutex but gets suspended since >>>> it is held by A. >>>> >>>> - A resumes and fpga_mgr_unregister() releases the manager device and >>> >>> The lifecycle of the manager device is not entirely decided by >>> fpga_mgr_unregister(), this func just puts/decreases the device >>> refcount. >> >> Right, but here I'm considering the case where no one has previously >> taken the manager device. In that specific case, the refcount will be > > I don't think this is valid, anyone should firstly get the manager > device via get_device() then try to access its context. > >> decremented to zero, and the device (with its context) will be released. > > If no one has taken the manager device, the device & its context are > safe to be released. > >> >> However, you got me thinking about how the mutex is causing the problem >> in this case. >> >>> >>> Remember fpga_mgr_get() gets the device via >>> class_find_device()->get_device(). I assume if the valid device pointer >>> could be returned by class_find_device(), it would never be released by >>> other nice players. So I have no worry about the per manager mutex. >>> >>>> its context, including the mutex on which B is suspended. >>>> >>>> We could mitigate this specific case using mutex_trylock(). However, >>>> there will still be other problematic cases, like if fpga_mgr_get() >>>> gets suspended right before taking the mutex and then delete_module >>>> proceeds up to when fpga_mgr_unregister() frees the manager device >>>> and its context. >>>> >>>>> I think the mgr->owner & mgr->ops should be protected from concurrent >>>>> access of delete_module & fpga_mgr_get/put(), so how about: >>>>> >>>>> struct fpga_manager_ops { >>>>> struct module *owner; >>>>> ... >>>>> }; >>>>> >>>>> struct fpga_manager { >>>>> ... >>>>> struct mutex mops_lock; >>>>> const struct fpga_manager_ops *mops; >>>>> ... >>>>> }; >>>>> >>>>> >>>>> static struct fpga_manager *__fpga_mgr_get(struct device *dev) >>>>> { >>>>> struct fpga_manager *mgr; >>>>> >>>>> mgr = to_fpga_manager(dev); >>>>> >>>>> mutex_lock(&mgr->mops_lock); >>>>> >>>>> if (!mgr->mops || !try_module_get(mgr->mops->owner)) >>>>> mgr = ERR_PTR(-ENODEV); >>>>> >>>>> mutex_unlock(&mgr->mops_lock); >>>>> >>>>> return mgr; >>>>> } >>>>> >>>>> void fpga_mgr_unregister(struct fpga_manager *mgr) >>>>> { >>>>> fpga_mgr_fpga_remove(mgr); >>>>> >>>>> mutex_lock(&mgr->ops_lock); >>>>> mgr->mops = NULL; >>>>> mutex_unlock(&mgr->ops_lock); >>>>> >>>>> device_unregister(&mgr->dev); >>>>> } >>>>> >>>>> Not actually tested. >>>>> >>>> >>>> I think protecting the only the ops is not enough for the same reasons. >>>> If fpga_mgr_get() gets suspended right after class_find_device(),and >>>> meanwhile the low-level module is removed, it resumes with a reference >>>> to a manager device that no longer exists. >>>> >>>> In a certain sense, however, using a mutex seems like a mitigation >>>> that does not solve the problem at its root. I honestly still think >>>> that taking the module's refcount right when registering the manager >>>> is the only way that is both safe and robust against code changes. >>> >>> I would nak either. As mentioned above, that makes rmmod vendor module >>> impossible. Introducing another user interface to release the module's >>> refcount is also a bad idea. Who decides to take the ref, who releases >>> it. A user has no knowledge of what is happening inside and should not >>> enforce. >>> >>>> However, my proposal was rejected. >>>> >>>> So, if you prefer, I can drop the mutex entirely in the next version, >>>> and we leave the responsibility of keeping all kernel pieces to the >>> >>> No, please try to fix it. Could you please reconsider my proposal? >>> >> >> I have considered it and thought about it. However, I don't think we need a >> mutex to protect mgr->mops. This is because the low-level module's refcount has >> already been decremented when fpga_mgr_unregister() is called by the module's >> exit function. So, when we get to the point of executing fpga_mgr_unregister(), >> any concurrent call to try_module_get() will fail (if no one has taken the > > Are you still taking care of your previous finding [1]? It says: > > To be clear, you should only use try_module_get() *iff* you are 100% sure > the module already does exist ... > > IIUC, if you do nothing on fpga_mgr_unregister(), the low-level module may > just disappear and any copy of the owner pointer became invalid. Then > try_module_get() would not fail but panic. > > [1] 557aafac1153 ("kernel/module: add documentation for try_module_get()") > You are right. I'll follow your proposal for locking. I would have preferred a solution that kept the module locked until it was safe to remove it, but I cannot come up with anything reasonable at the moment. > >> module before) without the need to check mops first. >> >> If we assume (as you pointed out) that class_find_device() can be safely >> executed concurrently with device_unregister() (returning either a valid dev >> pointer or null) and, consequently, the manager context is guaranteed to exist >> after that point. Then, we should be good without the mutex if we call >> try_module_get() on a copy of the owner pointer stored in the manager context. >> >>>> user. It will still be an improvement over taking the low-level >>>> module's refcount through the parent device's driver pointer. >>>> Thanks, Marco