Re: [PATCH v8 bpf-next 01/10] bpf: tcp: Avoid taking fast sock lock in iterator

Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx> · Thu, 18 May 2023 16:04:53 -0700

> On May 18, 2023, at 11:57 AM, Yonghong Song <yhs@xxxxxxxx> wrote:
> 
> 
> 
> On 5/17/23 10:54 AM, Aditi Ghag wrote:
>> This is a preparatory commit to replace `lock_sock_fast` with
>> `lock_sock`, and faciliate BPF programs executed from the iterator to be
> 
> facilitate

Yikes! I'll fix the typos. 

> 
>> able to destroy TCP listening sockets using the bpf_sock_destroy kfunc
>> (implemened in follow-up commits).  Previously, BPF TCP iterator was
> 
> implemented
> 
>> acquiring the sock lock with BH disabled. This led to scenarios where
>> the sockets hash table bucket lock can be acquired with BH enabled in
>> some context versus disabled in other, and  caused a
>> <softirq-safe> -> <softirq-unsafe> dependency with the sock lock.
> 
> For 'and caused a <softirq-safe> -> <softirq-unsafe> dependency with
> the sock lock', maybe can be rephrased like below:
> 
> In such situation, kernel issued an warning since it thinks that
> in the BH enabled path the same bucket lock *might* be acquired again
> in the softirq context (BH disabled), which will lead to a potential
> dead lock.

Hi Yonghong, I thought about this a bit more before posting the patch series. My reading of the splat was that the deadlock scenario that was specifically highlighted was with respect to the pair of bucket and sock locks. 

As for the bucket lock, there might a deadlock scenario with a set of events such as:
 1) Bucket lock is acquired with BH enabled in a process context  (e.g., __inet_hash below called from process context)
 2) the process context was interrupted before the lock was released by...
 3) Another context with BH disabled (e.g., sock_destroy called for listening socket from iterator) tries to acquire the same lock again

contd...

> 
> Note that in this particular triggering, the local_bh_disable()
> happens in process context, so the warning is a false alarm.

Right, the sock_destroy program is run from the iterator as opposed to BPF programs being executed on kernel events. However, my understanding is that because local_bh_disable is called, the lock dep validator treats it as an irq-safe context.

Based on my reading of the documentation [1], there can be a deadlock issue with the bucket lock by itself (ref: Single-lock state rules), or deadlock issue with the pair of bucket and sock locks that the splat highlights (ref: Multi-lock dependency rules). 

Let me know if this makes sense, or I'm missing something.  

[1] https://www.kernel.org/doc/Documentation/locking/lockdep-design.rst

-------- Posting a snippet of the splat again just for reference  --------

[    1.544410] which would create a new lock dependency:
[    1.544797]  (slock-AF_INET6){+.-.}-{2:2} -> (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.545361]
[    1.545361] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    1.545961]  (slock-AF_INET6){+.-.}-{2:2}
[    1.545963]
[    1.545963] ... which became SOFTIRQ-irq-safe at:
[    1.546745]   lock_acquire+0xcd/0x330
[    1.547033]   _raw_spin_lock+0x33/0x40
[    1.547325]   sk_clone_lock+0x146/0x520
[    1.547623]   inet_csk_clone_lock+0x1b/0x110
[    1.547960]   tcp_create_openreq_child+0x22/0x3f0
[    1.548327]   tcp_v6_syn_recv_sock+0x96/0x940
[    1.548672]   tcp_check_req+0x13f/0x640
[    1.548977]   tcp_v6_rcv+0xa62/0xe80
[    1.549258]   ip6_protocol_deliver_rcu+0x78/0x590
[    1.549621]   ip6_input_finish+0x72/0x140
[    1.549931]   __netif_receive_skb_one_core+0x63/0xa0
[    1.550313]   process_backlog+0x79/0x260
[    1.550619]   __napi_poll.constprop.0+0x27/0x170
[    1.550976]   net_rx_action+0x14a/0x2a0
[    1.551272]   __do_softirq+0x165/0x510
[    1.551563]   do_softirq+0xcd/0x100
[    1.551836]   __local_bh_enable_ip+0xcc/0xf0
[    1.552168]   ip6_finish_output2+0x27c/0xb10
[    1.552500]   ip6_finish_output+0x274/0x510
[    1.552823]   ip6_xmit+0x319/0x9b0
[    1.553095]   inet6_csk_xmit+0x12b/0x2b0
[    1.553398]   __tcp_transmit_skb+0x543/0xc30
[    1.553731]   tcp_rcv_state_process+0x362/0x1180
[    1.554088]   tcp_v6_do_rcv+0x10f/0x540
[    1.554387]   __release_sock+0x6a/0xe0
[    1.554679]   release_sock+0x2f/0xb0
[    1.554957]   __inet_stream_connect+0x1ac/0x3a0
[    1.555308]   inet_stream_connect+0x3b/0x60
[    1.555632]   __sys_connect+0xa3/0xd0
[    1.555915]   __x64_sys_connect+0x18/0x20
[    1.556222]   do_syscall_64+0x3c/0x90
[    1.556510]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.556909]
[    1.556909] to a SOFTIRQ-irq-unsafe lock:
[    1.557326]  (&h->lhash2[i].lock){+.+.}-{2:2}
[    1.557329]
[    1.557329] ... which became SOFTIRQ-irq-unsafe at:
[    1.558148] ...
[    1.558149]   lock_acquire+0xcd/0x330
[    1.558579]   _raw_spin_lock+0x33/0x40
[    1.558874]   __inet_hash+0x4b/0x210
[    1.559154]   inet_csk_listen_start+0xe6/0x100
[    1.559503]   inet_listen+0x95/0x1d0
[    1.559782]   __sys_listen+0x69/0xb0
[    1.560063]   __x64_sys_listen+0x14/0x20
[    1.560365]   do_syscall_64+0x3c/0x90
[    1.560652]   entry_SYSCALL_64_after_hwframe+0x72/0xdc
[    1.561052] other info that might help us debug this:
[    1.561052]

[    1.561658]  Possible interrupt unsafe locking scenario:
[    1.561658]
[    1.562171]        CPU0                    CPU1
[    1.562521]        ----                    ----
[    1.562870]   lock(&h->lhash2[i].lock);
[    1.563167]                                local_irq_disable();
[    1.563618]                                lock(slock-AF_INET6);
[    1.564076]                                lock(&h->lhash2[i].lock);
[    1.564558]   <Interrupt>
[    1.564763]     lock(slock-AF_INET6);
[    1.565053]
[    1.565053]  *** DEADLOCK ***

> 
>> Here is a snippet of annotated stack trace that motivated this change:
>> ```
>> Possible interrupt unsafe locking scenario:
>>       CPU0                    CPU1
>>       ----                    ----
>>  lock(&h->lhash2[i].lock);
>>                               local_irq_disable();
>>                               lock(slock-AF_INET6);
>>                               lock(&h->lhash2[i].lock);
> 
>                                 local_bh_disable();
>                                 lock(&h->lhash2[i].lock);
> 
>>  <Interrupt>
>>    lock(slock-AF_INET6);
>> *** DEADLOCK ***
> Replace the above with below:
> 
> kernel imagined possible scenario:
>    local_bh_disable();  /* Possible softirq */
>    lock(&h->lhash2[i].lock);
> *** Potential Deadlock ***
>> process context:
>> lock_acquire+0xcd/0x330
>> _raw_spin_lock+0x33/0x40
>> ------> Acquire (bucket) lhash2.lock with BH enabled
>> __inet_hash+0x4b/0x210
>> inet_csk_listen_start+0xe6/0x100
>> inet_listen+0x95/0x1d0
>> __sys_listen+0x69/0xb0
>> __x64_sys_listen+0x14/0x20
>> do_syscall_64+0x3c/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> bpf_sock_destroy run from iterator in interrupt context:
>> lock_acquire+0xcd/0x330
>> _raw_spin_lock+0x33/0x40
>> ------> Acquire (bucket) lhash2.lock with BH disabled
>> inet_unhash+0x9a/0x110
>> tcp_set_state+0x6a/0x210
>> tcp_abort+0x10d/0x200
>> bpf_prog_6793c5ca50c43c0d_iter_tcp6_server+0xa4/0xa9
>> bpf_iter_run_prog+0x1ff/0x340
>> ------> lock_sock_fast that acquires sock lock with BH disabled
>> bpf_iter_tcp_seq_show+0xca/0x190
>> bpf_seq_read+0x177/0x450
>> ```
>> Acked-by: Yonghong Song <yhs@xxxxxxxx>
>> Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
>> Signed-off-by: Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx>
>> ---
>>  net/ipv4/tcp_ipv4.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index ea370afa70ed..f2d370a9450f 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -2962,7 +2962,6 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  	struct bpf_iter_meta meta;
>>  	struct bpf_prog *prog;
>>  	struct sock *sk = v;
>> -	bool slow;
>>  	uid_t uid;
>>  	int ret;
>>  @@ -2970,7 +2969,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>  		return 0;
>>    	if (sk_fullsock(sk))
>> -		slow = lock_sock_fast(sk);
>> +		lock_sock(sk);
>>    	if (unlikely(sk_unhashed(sk))) {
>>  		ret = SEQ_SKIP;
>> @@ -2994,7 +2993,7 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v)
>>    unlock:
>>  	if (sk_fullsock(sk))
>> -		unlock_sock_fast(sk, slow);
>> +		release_sock(sk);
>>  	return ret;
>>    }