From: Eric Dumazet <edumazet@xxxxxxxxxx> Date: Mon, 26 Feb 2024 20:40:10 +0100 > On Mon, Feb 26, 2024 at 8:22 PM Allison Henderson > <allison.henderson@xxxxxxxxxx> wrote: > > > > On Mon, 2024-02-26 at 11:14 -0800, Kuniyuki Iwashima wrote: > > > From: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> > > > Date: Fri, 23 Feb 2024 10:28:32 -0800 > > > > From: Eric Dumazet <edumazet@xxxxxxxxxx> > > > > Date: Fri, 23 Feb 2024 19:09:27 +0100 > > > > > On Fri, Feb 23, 2024 at 6:26 PM Kuniyuki Iwashima > > > > > <kuniyu@xxxxxxxxxx> wrote: > > > > > > > > > > > > syzkaller reported a warning of netns tracker [0] followed by > > > > > > KASAN > > > > > > splat [1] and another ref tracker warning [1]. > > > > > > > > > > > > syzkaller could not find a repro, but in the log, the only > > > > > > suspicious > > > > > > sequence was as follows: > > > > > > > > > > > > 18:26:22 executing program 1: > > > > > > r0 = socket$inet6_mptcp(0xa, 0x1, 0x106) > > > > > > ... > > > > > > connect$inet6(r0, &(0x7f0000000080)={0xa, 0x4001, 0x0, > > > > > > @loopback}, 0x1c) (async) > > > > > > > > > > > > The notable thing here is 0x4001 in connect(), which is > > > > > > RDS_TCP_PORT. > > > > > > > > > > > > So, the scenario would be: > > > > > > > > > > > > 1. unshare(CLONE_NEWNET) creates a per netns tcp listener in > > > > > > rds_tcp_listen_init(). > > > > > > 2. syz-executor connect()s to it and creates a reqsk. > > > > > > 3. syz-executor exit()s immediately. > > > > > > 4. netns is dismantled. [0] > > > > > > 5. reqsk timer is fired, and UAF happens while freeing > > > > > > reqsk. [1] > > > > > > 6. listener is freed after RCU grace period. [2] > > > > > > > > > > > > Basically, reqsk assumes that the listener guarantees netns > > > > > > safety > > > > > > until all reqsk timers are expired by holding the listener's > > > > > > refcount. > > > > > > However, this was not the case for kernel sockets. > > > > > > > > > > > > Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in > > > > > > inet_twsk_purge()") fixed this issue only for per-netns ehash, > > > > > > but > > > > > > the issue still exists for the global ehash. > > > > > > > > > > > > We can apply the same fix, but this issue is specific to RDS. > > > > > > > > > > > > Instead of iterating potentially large ehash and purging reqsk > > > > > > during > > > > > > netns dismantle, let's hold netns refcount for the kernel TCP > > > > > > listener. > > > > > > > > > > > > > > > > > > Reported-by: syzkaller <syzkaller@xxxxxxxxxxxxxxxx> > > > > > > Fixes: 467fa15356ac ("RDS-TCP: Support multiple RDS-TCP listen > > > > > > endpoints, one per netns.") > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> > > > > > > --- > > > > > > net/rds/tcp_listen.c | 5 +++++ > > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > > > diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c > > > > > > index 05008ce5c421..4f7863932df7 100644 > > > > > > --- a/net/rds/tcp_listen.c > > > > > > +++ b/net/rds/tcp_listen.c > > > > > > @@ -282,6 +282,11 @@ struct socket *rds_tcp_listen_init(struct > > > > > > net *net, bool isv6) > > > > > > goto out; > > > > > > } > > > > > > > > > > > > + __netns_tracker_free(net, &sock->sk->ns_tracker, > > > > > > false); > > > > > > + sock->sk->sk_net_refcnt = 1; > > > > > > + get_net_track(net, &sock->sk->ns_tracker, GFP_KERNEL); > > > > > > + sock_inuse_add(net, 1); > > > > > > + > > > > > > > > > > Why using sock_create_kern() then later 'convert' this kernel > > > > > socket > > > > > to a user one ? > > > > > > > > > > Would using __sock_create() avoid this ? > > > > > > > > I think yes, but LSM would see kern=0 in pre/post socket() hooks. > > > > > > > > Probably we can use __sock_create() in net-next and see if someone > > > > complains. > > > > > > I noticed the patchwork status is Changes Requested. > > > https://urldefense.com/v3/__https://patchwork.kernel.org/project/netdevbpf/list/?series=829213&state=*__;Kg!!ACWV5N9M2RV99hQ!KHKUQKUDnNCdiEcb4ZK1VBiYSitarEb-CAWeSJvaeK04fgW4cuWePg3Ac2HmIAPUHuqeCwgt466fHEKAAdfa$ > > > > > > > > > Should we use __sock_create() for RDS or add another parameter > > > to __sock_create(..., kern=true/false, netref=true/false) and > > > fix other similar uses (MPTCP, SMC, Netlink) altogether ? > > > > > > Thanks! > > > > Hi all, > > > > Thank you for looking at this. I've been doing a little investigation > > in the area to better understand the issue and this fix. While I > > understand what this patch is trying to do here, I'd like to do a > > little more digging as to why 740ea3c4a0b2 didnt work for rds, or what > > else rds may not be doing correctly that the other sockets are. I'm > > not quite sure about setting the kern parameter to 0 for socket_create. > > While it seems like it would have a similar effect, this looks > > incorrect since this is not a user space socket. > > > > I'll do a little more diging myself too. If you had another idea about > > adding parameters to __sock_create, I'd be happy to take a look. Thank > > you! > > I wonder if the following change would help ? Yes, it also fixes the issue. :) https://lore.kernel.org/netdev/20240226193857.69672-1-kuniyu@xxxxxxxxxx/ but it will trigger full ehash iteration for netns with no RDS usage (and even without TCP). So, I think __sock_create() or the netref conversion would be better. > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index 9e85f2a0bddd4978b1bde6add1efc6aad351db8b..0ecc7311dc6ceedd8ada7b99b1441a562a6be4d6 > 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -398,10 +398,6 @@ void tcp_twsk_purge(struct list_head > *net_exit_list, int family) > /* Even if tw_refcount == 1, we must clean up > kernel reqsk */ > > inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family); > } else if (!purged_once) { > - /* The last refcount is decremented in > tcp_sk_exit_batch() */ > - if > (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1) > - continue; > - > inet_twsk_purge(&tcp_hashinfo, family); > purged_once = true; > }