Re: [PATCH v8 bpf-next 01/10] bpf: tcp: Avoid taking fast sock lock in iterator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On May 18, 2023, at 10:45 PM, Yonghong Song <yhs@xxxxxxxx> wrote:
> 
> 
> 
> On 5/18/23 4:04 PM, Aditi Ghag wrote:
>>> On May 18, 2023, at 11:57 AM, Yonghong Song <yhs@xxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>> On 5/17/23 10:54 AM, Aditi Ghag wrote:
>>>> This is a preparatory commit to replace `lock_sock_fast` with
>>>> `lock_sock`, and faciliate BPF programs executed from the iterator to be
>>> 
>>> facilitate
>> Yikes! I'll fix the typos.
>>> 
>>>> able to destroy TCP listening sockets using the bpf_sock_destroy kfunc
>>>> (implemened in follow-up commits).  Previously, BPF TCP iterator was
>>> 
>>> implemented
>>> 
>>>> acquiring the sock lock with BH disabled. This led to scenarios where
>>>> the sockets hash table bucket lock can be acquired with BH enabled in
>>>> some context versus disabled in other, and  caused a
>>>> <softirq-safe> -> <softirq-unsafe> dependency with the sock lock.
>>> 
>>> For 'and caused a <softirq-safe> -> <softirq-unsafe> dependency with
>>> the sock lock', maybe can be rephrased like below:
>>> 
>>> In such situation, kernel issued an warning since it thinks that
>>> in the BH enabled path the same bucket lock *might* be acquired again
>>> in the softirq context (BH disabled), which will lead to a potential
>>> dead lock.
>> Hi Yonghong, I thought about this a bit more before posting the patch series. My reading of the splat was that the deadlock scenario that was specifically highlighted was with respect to the pair of bucket and sock locks.
>> As for the bucket lock, there might a deadlock scenario with a set of events such as:
>>  1) Bucket lock is acquired with BH enabled in a process context  (e.g., __inet_hash below called from process context)
>>  2) the process context was interrupted before the lock was released by...
>>  3) Another context with BH disabled (e.g., sock_destroy called for listening socket from iterator) tries to acquire the same lock again
>> contd...
>>> 
>>> Note that in this particular triggering, the local_bh_disable()
>>> happens in process context, so the warning is a false alarm.
>> Right, the sock_destroy program is run from the iterator as opposed to BPF programs being executed on kernel events. However, my understanding is that because local_bh_disable is called, the lock dep validator treats it as an irq-safe context.
>> Based on my reading of the documentation [1], there can be a deadlock issue with the bucket lock by itself (ref: Single-lock state rules), or deadlock issue with the pair of bucket and sock locks that the splat highlights (ref: Multi-lock dependency rules).
>> Let me know if this makes sense, or I'm missing something.
>> [1] https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst
>> -------- Posting a snippet of the splat again just for reference  --------
>> [    1.544410] which would create a new lock dependency:
>> [    1.544797]  (slock-AF_INET6){+.-.}-{2:2} -> (&h->lhash2[i].lock){+.+.}-{2:2}
>> [    1.545361]
>> [    1.545361] but this new dependency connects a SOFTIRQ-irq-safe lock:
>> [    1.545961]  (slock-AF_INET6){+.-.}-{2:2}
>> [    1.545963]
>> [    1.545963] ... which became SOFTIRQ-irq-safe at:
>> [    1.546745]   lock_acquire+0xcd/0x330
>> [    1.547033]   _raw_spin_lock+0x33/0x40
>> [    1.547325]   sk_clone_lock+0x146/0x520
>> [    1.547623]   inet_csk_clone_lock+0x1b/0x110
>> [    1.547960]   tcp_create_openreq_child+0x22/0x3f0
>> [    1.548327]   tcp_v6_syn_recv_sock+0x96/0x940
>> [    1.548672]   tcp_check_req+0x13f/0x640
>> [    1.548977]   tcp_v6_rcv+0xa62/0xe80
>> [    1.549258]   ip6_protocol_deliver_rcu+0x78/0x590
>> [    1.549621]   ip6_input_finish+0x72/0x140
>> [    1.549931]   __netif_receive_skb_one_core+0x63/0xa0
>> [    1.550313]   process_backlog+0x79/0x260
>> [    1.550619]   __napi_poll.constprop.0+0x27/0x170
>> [    1.550976]   net_rx_action+0x14a/0x2a0
>> [    1.551272]   __do_softirq+0x165/0x510
>> [    1.551563]   do_softirq+0xcd/0x100
>> [    1.551836]   __local_bh_enable_ip+0xcc/0xf0
>> [    1.552168]   ip6_finish_output2+0x27c/0xb10
>> [    1.552500]   ip6_finish_output+0x274/0x510
>> [    1.552823]   ip6_xmit+0x319/0x9b0
>> [    1.553095]   inet6_csk_xmit+0x12b/0x2b0
>> [    1.553398]   __tcp_transmit_skb+0x543/0xc30
>> [    1.553731]   tcp_rcv_state_process+0x362/0x1180
>> [    1.554088]   tcp_v6_do_rcv+0x10f/0x540
>> [    1.554387]   __release_sock+0x6a/0xe0
>> [    1.554679]   release_sock+0x2f/0xb0
>> [    1.554957]   __inet_stream_connect+0x1ac/0x3a0
>> [    1.555308]   inet_stream_connect+0x3b/0x60
>> [    1.555632]   __sys_connect+0xa3/0xd0
>> [    1.555915]   __x64_sys_connect+0x18/0x20
>> [    1.556222]   do_syscall_64+0x3c/0x90
>> [    1.556510]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [    1.556909]
>> [    1.556909] to a SOFTIRQ-irq-unsafe lock:
>> [    1.557326]  (&h->lhash2[i].lock){+.+.}-{2:2}
>> [    1.557329]
>> [    1.557329] ... which became SOFTIRQ-irq-unsafe at:
>> [    1.558148] ...
>> [    1.558149]   lock_acquire+0xcd/0x330
>> [    1.558579]   _raw_spin_lock+0x33/0x40
>> [    1.558874]   __inet_hash+0x4b/0x210
>> [    1.559154]   inet_csk_listen_start+0xe6/0x100
>> [    1.559503]   inet_listen+0x95/0x1d0
>> [    1.559782]   __sys_listen+0x69/0xb0
>> [    1.560063]   __x64_sys_listen+0x14/0x20
>> [    1.560365]   do_syscall_64+0x3c/0x90
>> [    1.560652]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> [    1.561052] other info that might help us debug this:
>> [    1.561052]
>> [    1.561658]  Possible interrupt unsafe locking scenario:
>> [    1.561658]
>> [    1.562171]        CPU0                    CPU1
>> [    1.562521]        ----                    ----
>> [    1.562870]   lock(&h->lhash2[i].lock);
>> [    1.563167]                                local_irq_disable();
>> [    1.563618]                                lock(slock-AF_INET6);
>> [    1.564076]                                lock(&h->lhash2[i].lock);
>> [    1.564558]   <Interrupt>
>> [    1.564763]     lock(slock-AF_INET6);
>> [    1.565053]
>> [    1.565053]  *** DEADLOCK ***
>>> 
>>>> Here is a snippet of annotated stack trace that motivated this change:
>>>> ```
>>>> Possible interrupt unsafe locking scenario:
>>>>       CPU0                    CPU1
>>>>       ----                    ----
>>>>  lock(&h->lhash2[i].lock);
>>>>                               local_irq_disable();
>>>>                               lock(slock-AF_INET6);
>>>>                               lock(&h->lhash2[i].lock);
>>> 
>>>                                 local_bh_disable();
>>>                                 lock(&h->lhash2[i].lock);
>>> 
>>>>  <Interrupt>
>>>>    lock(slock-AF_INET6);
>>>> *** DEADLOCK ***
>>> Replace the above with below:
>>> 
>>> kernel imagined possible scenario:
>>>    local_bh_disable();  /* Possible softirq */
>>>    lock(&h->lhash2[i].lock);
>>> *** Potential Deadlock ***
> 
> I applied the whole patch set (v8) locally except Patch 1, running
> selftest and hit a different warning:
> 
>  ...
>  [  168.780736] watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [test_progs:2331]
>  [  168.781385] Modules linked in: bpf_testmod(OE)
>  [  168.781751] CPU: 0 PID: 2331 Comm: test_progs Tainted: G OEL     6.4.0-rc1-00336-g2fa1ad98e6e8-dirty #258
>  [  168.782570] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>  [  168.783457] RIP: 0010:queued_spin_lock_slowpath+0xd8/0x500
>  [  168.783904] Code: 00 ff ff 44 23 33 45 09 fe 48 8d 7c 24 04 e8 0f 54 ca fe 44 89 74 24 04 41 81 fe 00 01 00 00 73 28 45 85 f6 75 04 eb 0f f3 90 <48> 89 df e8 d0 51 ca fe 80 3b 00 6
>  [  168.785503] RSP: 0018:ffff888114557c00 EFLAGS: 00000202
>  [  168.785964] RAX: 0000000000000000 RBX: ffff888113fd0a98 RCX: ffffffff827c84a0
>  [  168.786576] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff888113fd0a98
>  [  168.787192] RBP: 0000000000000000 R08: dffffc0000000000 R09: ffffed10227fa154
>  [  168.787837] R10: 0000000000000000 R11: dffffc0000000001 R12: ffff888113fd0a98
>  [  168.788505] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
>  [  168.789119] FS:  00007fc34f075500(0000) GS:ffff8881f7400000(0000) knlGS:0000000000000000
>  [  168.789804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  [  168.790306] CR2: 0000559382dd9057 CR3: 0000000102ab8004 CR4: 0000000000370ef0
>  [  168.790976] Call Trace:
>  [  168.791218]  <TASK>
>  [  168.791434]  _raw_spin_lock+0x84/0x90
>  [  168.791785]  tcp_abort+0x13c/0x1f0
>  [  168.792125]  bpf_prog_88539c5453a9dd47_iter_tcp6_client+0x82/0x89
>  [  168.792701]  bpf_iter_run_prog+0x1aa/0x2c0
>  [  168.793098]  ? preempt_count_sub+0x1c/0xd0
>  [  168.793488]  ? from_kuid_munged+0x1c8/0x210
>  [  168.793886]  bpf_iter_tcp_seq_show+0x14e/0x1b0
>  [  168.794326]  bpf_seq_read+0x36c/0x6a0
>  [  168.794686]  vfs_read+0x11b/0x440
>  [  168.795024]  ksys_read+0x81/0xe0
>  [  168.795341]  do_syscall_64+0x41/0x90
>  [  168.795689]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>  [  168.796172] RIP: 0033:0x7fc34f25479c
>  [  168.796514] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 fc ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 8
>  [  168.798197] RSP: 002b:00007fffc299b5a0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>  [  168.798891] RAX: ffffffffffffffda RBX: 0000559382dc77f0 RCX: 00007fc34f25479c
>  [  168.799552] RDX: 0000000000000032 RSI: 00007fffc299b640 RDI: 0000000000000019
>  [  168.800213] RBP: 00007fffc299b690 R08: 0000000000000000 R09: 00007fffc299b4a7
>  [  168.800868] R10: 0000000000000000 R11: 0000000000000246 R12: 0000559382b2bf70
>  [  168.801530] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  [  168.802196]  </TASK>
> The lockup seems true since no further progress of selftest since the above error/warning. So we hit a real deadlock here.
> 
> I did some analysis, the following is what could be happened:
>   bpf_iter_tcp_seq_show
>     lock_sock_fast
>       __lock_sock_fast
>         spin_lock_bh(&sk->sk_lock.slock);
>   ...
>   tcp_abort
>     local_bh_disable();
>     spin_lock(&((sk)->sk_lock.slock)); // from bh_lock_sock(sk)
> 
> So we have deadlock here for the sock.
> With Patch 1, we use 'lock_sock', sock lock is not held, so there is no dead lock.
> static inline void lock_sock(struct sock *sk)
> {
>        lock_sock_nested(sk, 0);
> }
> void lock_sock_nested(struct sock *sk, int subclass)
> {
>        /* The sk_lock has mutex_lock() semantics here. */
>        mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
> 
>        might_sleep();
>        spin_lock_bh(&sk->sk_lock.slock);
>        if (sock_owned_by_user_nocheck(sk))
>                __lock_sock(sk);
>        sk->sk_lock.owned = 1;
>        spin_unlock_bh(&sk->sk_lock.slock);
> }
> EXPORT_SYMBOL(lock_sock_nested);
> void __lock_sock(struct sock *sk)
>        __releases(&sk->sk_lock.slock)
>        __acquires(&sk->sk_lock.slock)
> {
>        DEFINE_WAIT(wait);
>        for (;;) {
>                prepare_to_wait_exclusive(&sk->sk_lock.wq, &wait,
>                                        TASK_UNINTERRUPTIBLE);
>                spin_unlock_bh(&sk->sk_lock.slock);
>                schedule();
>                spin_lock_bh(&sk->sk_lock.slock);
>                if (!sock_owned_by_user(sk))
>                        break;
>        }
>        finish_wait(&sk->sk_lock.wq, &wait);
> }
> 
> The current stack trace and analysis likely from some of
> previous versions of patch.
> 

The current stack trace is for the iter_tcp6_server test specifically. As the commit message suggests, the potential deadlock warning was triggered for the case when TCP listening sockets are getting destroyed, which is what the test involves. You should see the current stack trace when running only that particular test without patch 1 (which is how I encountered the issue when I introduced that test in one of the middle versions of the patch series). 
Thanks for the additional pair of eyes on the stack trace analysis! 

So looks like this patch ended up resolving the real deadlock issue as well. 

> I suggest to rerun based on the latest patch set, collect
> the warning message and resubmit Patch 1.
> 
>>>> process context:
>>>> lock_acquire+0xcd/0x330
>>>> _raw_spin_lock+0x33/0x40
>>>> ------> Acquire (bucket) lhash2.lock with BH enabled
>>>> __inet_hash+0x4b/0x210
>>>> inet_csk_listen_start+0xe6/0x100
>>>> inet_listen+0x95/0x1d0
>>>> __sys_listen+0x69/0xb0
>>>> __x64_sys_listen+0x14/0x20
>>>> do_syscall_64+0x3c/0x90
>>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>> bpf_sock_destroy run from iterator in interrupt context:
>>>> lock_acquire+0xcd/0x330
>>>> _raw_spin_lock+0x33/0x40
>>>> ------> Acquire (bucket) lhash2.lock with BH disabled
>>>> inet_unhash+0x9a/0x110
>>>> tcp_set_state+0x6a/0x210
>>>> tcp_abort+0x10d/0x200
>>>> bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
>>>> bpf_iter_run_prog+0x1ff/0x340
>>>> ------> lock_sock_fast that acquires sock lock with BH disabled
>>>> bpf_iter_tcp_seq_show+0xca/0x190
>>>> bpf_seq_read+0x177/0x450
>>>> ```
>>>> Acked-by: Yonghong Song <yhs@xxxxxxxx>
>>>> Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
>>>> Signed-off-by: Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx>
>>>> ---
>>>>  net/ipv4/tcp_ipv4.c | 5 ++---
>>>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>>>> index ea370afa70ed..f2d370a9450f 100644
>>>> --- a/net/ipv4/tcp_ipv4.c
>>>> +++ b/net/ipv4/tcp_ipv4.c
>>>> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>>>  	struct bpf_iter_meta meta;
>>>>  	struct bpf_prog *prog;
>>>>  	struct sock *sk = v;
>>>> -	bool slow;
>>>>  	uid_t uid;
>>>>  	int ret;
>>>>  @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>>>  		return 0;
>>>>    	if (sk_fullsock(sk))
>>>> -		slow = lock_sock_fast(sk);
>>>> +		lock_sock(sk);
>>>>    	if (unlikely(sk_unhashed(sk))) {
>>>>  		ret = SEQ_SKIP;
>>>> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>>>    unlock:
>>>>  	if (sk_fullsock(sk))
>>>> -		unlock_sock_fast(sk, slow);
>>>> +		release_sock(sk);
>>>>  	return ret;
>>>>    }






[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux