Hi, On 9/6/2021 6:25 PM, Christoph Hellwig wrote: > On Mon, Sep 06, 2021 at 06:08:54PM +0800, Hou Tao wrote: >>>> + if (!try_module_get(THIS_MODULE)) >>>> + return ERR_PTR(-ENODEV); >>> try_module_get(THIS_MODULE) is an indicator for an unsafe pattern. If >>> we don't already have a reference it could never close the race. >>> >>> Looking at the callers: >>> >>> - nbd_open like all block device operations must have a reference >>> already. >> Yes. nbd_open() has already taken a reference in dentry_open(). >>> - for nbd_genl_connect I'm not an expert, but given that struct >>> nbd_genl_family has a module member I suspect the networkinh >>> code already takes a reference. >> That was my original though, but the fact is netlink code doesn't take a module reference >> >> in genl_family_rcv_msg_doit() and netlink uses genl_lock_all() to serialize between module removal >> >> and nbd_connect_genl_ops calling, so I think use try_module_get() is OK here. > How it this going to work? If there was a race you just shortened it, > but it can still happen before you call try_module_get. So I think we > need to look into how the netlink calling conventions are supposed to > look and understand the issues there first. > . Let me explain first. The reason it works is due to genl_lock_all() in netlink code. If the module removal happens before calling try_module_get(), nbd_cleanup() will call genl_unregister_family() first, and then genl_lock_all(). genl_lock_all() will prevent ops in nbd_connect_genl_ops() from being called, because the calling of nbd ops happens in genl_rcv() which needs to acquire cb_lock first. process A process B module removal genl_unregister_family() genl_lock_all() down_write(&cb_lock) receive a new netlink message genl_rcv() // will wait for the removal of nbd ops down_read(&cb_lock) If nbd_alloc_config() happens before the module removal, genl_rcv() must have been acquired cb_lock & genl_mutex, so nbd_cleanup() will block in genl_unregister_family(). When nbd_alloc_config() calls try_module_get(), it will find out the module is dying, so fail nbd_genl_connect(). process A process B a new netlink message genl_rcv() down_read(&cb_lock) mutex_lock(&genl_mutex) nbd_genl_connect() nbd_alloc_config() module removal genl_unregister_family // module is dying, so fail try_module_get() genl_lock_all() // wait for the completion of nbd ops down_write(&cb_lock) I have checked multiple genl_ops, it seems that the reason why these genl_ops don't need try_module_get() is that these ops don't create new object through genl_ops and just control it. However genl_family_rcv_msg_dumpit() will try to call try_module_get(), but according to the history (6dc878a8ca39 "netlink: add reference of module in netlink_dump_start"), it is because inet_diag_handler_cmd() will call __netlink_dump_start(). Regards, Tao