Re: [Patch bpf-next] tcp: fix sock skb accounting in tcp_read_skb()

Cong Wang <xiyou.wangcong@xxxxxxxxx> · Tue, 26 Jul 2022 09:14:35 -0700

On Mon, Jul 25, 2022 at 10:45:56AM +0200, Eric Dumazet wrote:
> On Sun, Jul 24, 2022 at 7:59 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > On Mon, Jul 18, 2022 at 09:26:29AM +0200, Eric Dumazet wrote:
> > > On Sun, Jul 17, 2022 at 6:56 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> > > > > On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > > > >
> > > > > > From: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > > > > >
> > > > > > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > > > > > ->read_skb()"), skb was not dequeued from receive queue hence
> > > > > > when we close TCP socket skb can be just flushed synchronously.
> > > > > >
> > > > > > After this commit, we have to uncharge skb immediately after being
> > > > > > dequeued, otherwise it is still charged in the original sock. And we
> > > > > > still need to retain skb->sk, as eBPF programs may extract sock
> > > > > > information from skb->sk. Therefore, we have to call
> > > > > > skb_set_owner_sk_safe() here.
> > > > > >
> > > > > > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > > > > > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > > > > Tested-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> > > > > > Cc: Eric Dumazet <edumazet@xxxxxxxxxx>
> > > > > > Cc: John Fastabend <john.fastabend@xxxxxxxxx>
> > > > > > Signed-off-by: Cong Wang <cong.wang@xxxxxxxxxxxxx>
> > > > > > ---
> > > > > >  net/ipv4/tcp.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > index 9d2fd3ced21b..c6b1effb2afd 100644
> > > > > > --- a/net/ipv4/tcp.c
> > > > > > +++ b/net/ipv4/tcp.c
> > > > > > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> > > > > >                 int used;
> > > > > >
> > > > > >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > > > > > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> > > > > >                 used = recv_actor(sk, skb);
> > > > > >                 if (used <= 0) {
> > > > > >                         if (!copied)
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > >
> > > > > I am reading tcp_read_skb(),it seems to have other bugs.
> > > > > I wonder why syzbot has not caught up yet.
> > > >
> > > > As you mentioned this here I assume you suggest I should fix all bugs in
> > > > one patch? (I am fine either way in this case, only slightly prefer to fix
> > > > one bug in each patch for readability.)
> > >
> > > I only wonder that after fixing all bugs, we might end up with  tcp_read_sk()
> > > being a clone of tcp_read_sock() :/
> >
> > I really wish so, but unfortunately the partial read looks impossible to
> > merged with full skb read.
> >
> >
> > >
> > > syzbot has a relevant report:
> > >
> >
> > Please provide a reproducer if you have, I don't see this report
> > anywhere (except here of course).
> 
> No repro yet.
> 
> I usually hold syzbot report until they have enough signal (repro, and
> eventually bisection) in them to be considered.
> 
> >
> > > ------------[ cut here ]------------
> > > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Modules linked in:
> > > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > BIOS Google 06/29/2022
> > > Workqueue: events nsim_dev_trap_report_work
> > > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <IRQ>
> > > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > > dst_input include/net/dst.h:461 [inline]
> > > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > > napi_poll net/core/dev.c:6573 [inline]
> > > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > > </IRQ>
> > > <TASK>
> > > do_softirq kernel/softirq.c:464 [inline]
> > > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsi
> m/dev.c:840
> > > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > > </TASK>------------[ cut here ]------------
> > > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Modules linked in:
> > > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > BIOS Google 06/29/2022
> > > Workqueue: events nsim_dev_trap_report_work
> > > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <IRQ>
> > > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > > dst_input include/net/dst.h:461 [inline]
> > > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > > napi_poll net/core/dev.c:6573 [inline]
> > > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > > </IRQ>
> > > <TASK>
> > > do_softirq kernel/softirq.c:464 [inline]
> > > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
> > > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > > </TASK>
> > >
> > > >
> > > > >
> > > > > It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> > > > > The reason tcp_read_sock() passes a @len parameter is that is it not
> > > > > skb->len, but (skb->len - offset)
> > > >
> > > > If I understand tcp_recv_skb() correctly it only returns an offset for a
> > > > partial read of an skb. IOW, if we always read an entire skb at a time,
> > > > offset makes no sense here, right?
> > > >
> > > > >
> > > > > Also if recv_actor(sk, skb) returns 0, we probably still need to
> > > > > advance tp->copied_seq,
> > > > > for instance if skb had a pure FIN (and thus skb->len == 0), since you
> > > > > removed the skb from sk_receive_queue ?
> > > >
> > > > Doesn't the following code handle this case?
> > > >
> > > >         if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > > >                 consume_skb(skb);
> > > >                 ++seq;
> > > >                 break;
> > > >         }
> > > >
> > > > which is copied from tcp_read_sock()...
> > >
> > > I do not think this is enough, because you can break from the loop
> > > before doing this  check about TCPHDR_FIN,
> >
> > The logic is same for tcp_read_sock(). :)
> >
> >
> > > after skb has been unlinked from sk_receive_queue. TCP won't be able
> > > to catch FIN.
> >
> > So TCP does not process FIN before ->sk_data_ready()? I wonder how FIN
> > (at least a pure FIN as you mentioned above) ends up being queued in
> > ->sk_receive_queue anyway?
> 
> That's how TCP stores packets, including the final FIN.
> 
> (Because this skb can also contain payload anyway)

If TCP really wants to queue a FIN with skb->len==0, then we have to
adjust the return value for recv_actor(), because we currently use 0 as
an error too (meaning no data is consumed):

        if (sk_psock_verdict_apply(psock, skb, ret) < 0)
                len = 0;  // here!
out:
        rcu_read_unlock();
        return len;

BTW, what is wrong if we simply drop it before queueing to
sk_receive_queue in TCP? Is it there just for collapse?

Thanks.