在 2024/11/12 8:54, NeilBrown 写道:
On Tue, 12 Nov 2024, Kuniyuki Iwashima wrote:
From: "NeilBrown" <neilb@xxxxxxx>
Date: Tue, 12 Nov 2024 10:52:34 +1100
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6f272013fd9b..d4330aaadc23 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1551,6 +1551,10 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
newlen = error;
if (protocol == IPPROTO_TCP) {
+ __netns_tracker_free(net, &sock->sk->ns_tracker, false);
+ sock->sk->sk_net_refcnt = 1;
+ get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL);
+ sock_inuse_add(net, 1);
This is really ugly. These internal details of the network layer have
no place in sunrpc code. There must be a better way.
I asked to do this way. I agree this way is really ugly. Similar
code exists in MPTCP, SMC, CIFS, etc, so I plan to add a new API for
this case, but this requires huge change adding a new parameter for
->create() prototype and the changes are not backportable.
https://github.com/q2ven/linux/commit/bb8b8814a73b3f50c3fef5eaf8d30d8c1df43e7b
https://github.com/q2ven/linux/commits/427_2
After my series, we can use the following but cannot backport it to
stable.
sock_create_net(net, family, type, protocol);
e.g. commit for MPTCP
https://github.com/q2ven/linux/commit/24a4647561272c1e67a685d8403e27eb863398cf
That's why I suggested to go with the ugly way and I will clean them
up in the next cycle.
So, finally the sunrpc code will be much cleaner and the netns refcnt
will be touched only in the core code.
This fact needs to be spelled out in the commit message:
This is an ugly hack which can easily be backported to earlier
kernels. A proper fix which cleans up the interfaces will follow,
but will not be so easy to backport.
or something like that.
Already added to v4, thank you.
I would still prefer if a little helper were made available so sunrpc
could just call one function rather than adding 4 cryptic lines. But I
won't argue that too strongly.
Let's just leave it at that for now. ):
Thanks,
NeilBrown
Can we pass '0' for the kern arg to __sock_create()? That should fix
the refcounting issues, but might mess up security labelling.
This should be avoided as it's confusing for BPF programs, LSMs, and
LOCKDEP.
Can we wait for something before we call put_net() to release the net.
Maybe we want to split the "kern" arg t __sock_create() and have
"kern" which affects labeling and "refnet" with affects refcounting the
net.
This is exactly what my series does, but again, it's not backport
friendly.
https://github.com/q2ven/linux/commit/413e867b4aee9e9f60f3c33fb38d2004aeb29c40
I had a quick look and very nearly every caller of __sock_create()
outside of net/core really does want refcount. Many callers of
sock_create_kern() possibly don't.
Actually, since sock_create_kern() is added, we no longer need to
export __sock_create(), so I have a patch to convert them to
sock_create_kern().
And most of TCP socket does need refcnt, but non-TCP won't.
Also, handshake one is exception, which uses TCP but only in init_net,
where we need not take care of netns refcnt.
https://github.com/q2ven/linux/commit/b56888bbbf327d57ea25a6b97275d6b9b8ad043a
So I really think this needs to be cleaned up in net/core, not in all
the different network clients in the kernel.
Yes, will be done in the next cycle.