On Fri, Nov 03, 2023 at 08:38 PM -07, John Fastabend wrote: > Jakub Sitnicki wrote: >> On Fri, Oct 27, 2023 at 10:38 AM -07, Kuniyuki Iwashima wrote: >> > From: Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> >> > Date: Fri, 27 Oct 2023 15:32:15 +0200 >> >> On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote: >> >> > Jakub Sitnicki wrote: >> >> >> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote: >> >> >> > AF_UNIX sockets are a paired socket. So sending on one of the pairs >> >> >> > will lookup the paired socket as part of the send operation. It is >> >> >> > possible however to put just one of the pairs in a BPF map. This >> >> >> > currently increments the refcnt on the sock in the sockmap to >> >> >> > ensure it is not free'd by the stack before sockmap cleans up its >> >> >> > state and stops any skbs being sent/recv'd to that socket. >> >> >> > >> >> >> > But we missed a case. If the peer socket is closed it will be >> >> >> > free'd by the stack. However, the paired socket can still be >> >> >> > referenced from BPF sockmap side because we hold a reference >> >> >> > there. Then if we are sending traffic through BPF sockmap to >> >> >> > that socket it will try to dereference the free'd pair in its >> >> >> > send logic creating a use after free. And following splat, >> >> >> > >> >> >> > [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0 >> >> >> > [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954 >> >> >> > [...] >> >> >> > [59.905468] Call Trace: >> >> >> > [59.905787] <TASK> >> >> >> > [59.906066] dump_stack_lvl+0x130/0x1d0 >> >> >> > [59.908877] print_report+0x16f/0x740 >> >> >> > [59.910629] kasan_report+0x118/0x160 >> >> >> > [59.912576] sk_wake_async+0x31/0x1b0 >> >> >> > [59.913554] sock_def_readable+0x156/0x2a0 >> >> >> > [59.914060] unix_stream_sendmsg+0x3f9/0x12a0 >> >> >> > [59.916398] sock_sendmsg+0x20e/0x250 >> >> >> > [59.916854] skb_send_sock+0x236/0xac0 >> >> >> > [59.920527] sk_psock_backlog+0x287/0xaa0 >> >> >> >> >> >> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to >> >> >> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get >> >> >> helper. >> >> > >> >> > It does by my read. In unix_stream_connect we have, >> >> > >> >> > sock_hold(sk); >> >> > unix_peer(newsk) = sk; >> >> > newsk->sk_state = TCP_ESTABLISHED; >> >> > >> >> > where it assigns the peer sock. unix_dgram_connect() also calls >> >> > sock_hold() but through the path that does the socket lookup, such as >> >> > unix_find_other(). >> >> > >> >> > The problem I see is before the socket does the kfree on the >> >> > sock we need to be sure the backlog is canceled and the skb list >> >> > ingress_skb is purged. If we don't ensure this then the redirect >> >> > will >> >> > >> >> > My model is this, >> >> > >> >> > s1 c1 >> >> > refcnt 1 1 >> >> > connect 2 2 >> >> > psock 3 3 >> >> > send(s1) ... >> >> > close(s1) 2 1 <- close drops psock count also >> >> > close(c1) 0 0 >> >> > >> >> > The important bit here is the psock has a refcnt on the >> >> > underlying sock (psock->sk) and wont dec that until after >> >> > cancel_delayed_work_sync() completes. This ensures the >> >> > backlog wont try to sendmsg() on that sock after we free >> >> > it. We also check for SOCK_DEAD and abort to avoid sending >> >> > over a socket that has been marked DEAD. >> >> > >> >> > So... After close(s1) the only thing keeping that sock >> >> > around is c1. Then we close(c1) that call path is >> >> > >> >> > unix_release >> >> > close() >> >> > unix_release_sock() >> >> > skpair = unix_peer(sk); >> >> > ... >> >> > sock_put(skpair); <- trouble here >> >> > >> >> > The release will call sock_put() on the pair socket and >> >> > dec it to 0 where it gets free'd through sk_free(). But >> >> > now the trouble is we haven't waited for cancel_delayed_work_sync() >> >> > on the c1 socket yet so backlog can still run. When it does >> >> > run it may try to send a pkg over socket s1. OK right up until >> >> > the sendmsg(s1, ...) does a peer lookup and derefs the peer >> >> > socket. The peer socket was free'd earlier so use after free. >> >> > >> >> > The question I had originally was this is odd, we are allowing >> >> > a sendmsg(s1) over a socket while its in unix_release(). We >> >> > used to take the sock lock from the backlog that was dropped >> >> > in the name of performance, but it creates these races. >> >> > >> >> > Other fixes I considered. First adding sock lock back to >> >> > backlog. But that punishes the UDP and TCP cases that don't >> >> > have this problem. Set the SOCK_DEAD flag earlier or check >> >> > later but this just makes the race smaller doesn't really >> >> > eliminate it. >> >> > >> >> > So this patch is what I came up with. >> >> >> >> What I was getting at is that we could make it safe to call sendmsg on a >> >> unix stream sock while its peer is being release. And not just for >> >> sockmap. I expect io_uring might have the same problem. But I didn't >> >> actually check yet. >> >> >> >> For that we could keep a ref to peer for the duration of sendmsg call, >> >> like unix dgram does. Then 'other' doesn't become a stale pointer before >> >> we're done with it. >> >> >> >> Bumping ref count on each sendmsg is not free, but maybe its >> >> acceptable. Unix dgram sockets live with it. >> > >> > The reason why only dgram sk needs sock_hold() for each sendmsg() is >> > that dgram sk can send data without connect(). unix_peer_get() in >> > unix_dgram_sendmsg() is to reuse the same code when peer is not set. >> > >> > unix_stream_sendmsg() already holds a necessary refcnt and need not >> > use sock_hold() there. >> > >> > The user who touches a peer without lookup must hold refcnt beforehand. > > Hi, we probably do need to get a fix for this. syzkaller hit it again > and anyways it likely will crash some real systems if folks try to use > it with enough systems. > >> >> Right. And this ownership scheme works well for unix stream because, as >> John nicely explained, we serialize close() and sendmsg() ops with sock >> lock. >> >> Here, however, we have a case of deferred work which holds a ref to sock >> but does not grab the sock lock. While it is doing its thing, the sk >> gets closed/released and we drop the skpair ref. And bam, UAF. >> >> If it wasn't for the reference cycle between sk and skpair, we could >> defer the skpari ref drop until sk_destruct callback. But we can't. >> >> If grabbing a ref on skpair on each sendmsg turns out to be not a viable >> option, I didn't run any benchmarks so can't say what's the penatly >> like, the next best thing is RCU. > > I think it really would be best to stay out of the presumably hotpath > here if we can. > > I had considered marking the socket SOCK_RCU_FREE which should then > wait an rcu grace period. But then it wasn't clear to me that would > completely solve the race. The psock still needs to do the > cancel_delayed_work_sync() and this is also done from the rcu call > back on the psock. Withtout the extra reference iirc the concern > was we would basically have two rcu callbacks running that have > an ordering requirement which I don't think is ensured. > I've checked io_uring and it keeps a ref to the related file for the lifetime the I/O request. So I think we are good there and it's only sockmap that is "bleeding". I agree it would be best not to hamper unix_stream sendmsg performance. At the same time, I think we can easily make the extra ref grab in unix_stream_sendmsg optional, keeping the fast path fast (modulo 2x a branch op). I realize RCU would be a bigger, riskier overhaul. Perhaps not worth it, if there are no benefits for "the regular" unix_stream users. If we can avoid managing more state in psock, that would be my preference. But I don't want to block any fixes that users are waiting for.