From: "liujian (CE)" <liujian56@xxxxxxxxxx> Date: Thu, 7 Nov 2024 20:03:40 +0800 > >> diff --git a/net/socket.c b/net/socket.c > >> index 042451f01c65..e64a02445b1a 100644 > >> --- a/net/socket.c > >> +++ b/net/socket.c > >> @@ -1651,6 +1651,34 @@ int sock_create_kern(struct net *net, int family, int type, int protocol, struct > >> } > >> EXPORT_SYMBOL(sock_create_kern); > >> > >> +int sock_create_kern_getnet(struct net *net, int family, int type, int proto, struct socket **res) > >> +{ > >> + struct sock *sk; > >> + int ret; > >> + > >> + if (!maybe_get_net(net)) > >> + return -EINVAL; > > > > Is this really safe ? > > > > IIUC, maybe_get_net() is safe for a net only when it is fetched under > > RCU, then rcu_read_lock() prevents cleanup_net() from reusing the net > > by rcu_barrier(). > > > > Otherwise, there should be a small chance that the same slab object is > > reused for another netns between fetching the net and reaching here. > > > > svc_create_socket() is called much later after the netns is fetched, > > and _svc_xprt_create() calls try_module_get() before ->xpo_create(). > > So, it seems the path is not under RCU and maybe_get_net() must be > > called much earlier by each call site. > > > > For this reason, when I write a patch for the same issue in CIFS, > > I delayed put_net() to cifsd kthread so that the netns refcnt taken > > for each CIFS server info lives until the last __sock_create() attempt > > from cifsd. > > > > https://lore.kernel.org/linux-cifs/20241102212438.76691-1-kuniyu@xxxxxxxxxx/ > > > Okay, got it. thank you. > Looking at the nfs and nfsd processing flow, it seems that the call to > __sock_create() to create a TCP socket is always after the mount > operation get_net(). So it should be fine to use get_net() directly. Is there any chance that a concurrent unmount releases the last refcount by put_net() while another thread trying to call __sock_create() ? CIFS was the case. > So > here I'm going to change may_get_net() to get_net(), move > sock_create_kern_getnet() to the sunrpc module, and rename it to > something more appropriate. Is that okay? Could you go without adding a helper and do the conversion in sunrpc code as CIFS did ? I plan to resurrect my patch and remove such socket conversion altogether in the next cycle after the CIFS fix lands on net-next. https://lore.kernel.org/netdev/20240227011041.97375-4-kuniyu@xxxxxxxxxx/ https://github.com/q2ven/linux/commits/427_2 https://github.com/q2ven/linux/commit/2e54a8cc84f1e9ce60a0e4693c79a8e74c3dbeb9 I inspected all the callers of __sock_create() and friends, and all __sock_create() can be replaced with sock_create_kern(), so I will unexport __sock_create() and then add a new parameter hold_net to it. Then, I'll rename sock_create_kern() to sock_create_net_noref() and add a fat comment to catch in-kernel users attention so that they no longer use _kern() API blindly without care about netns reference. Also, I'll add sock_create_net() and use it for MPTCP, SMC, CIFS, (and sunrpc) etc. RDS uses maybe_net_get() but I think this is still buggy and I need to check more.