Re: [PATCH 1/2] bpf: Add socket destroy capability

Aditi Ghag <aditi.ghag@xxxxxxxxxxxxx> · Thu, 23 Feb 2023 14:05:29 -0800

> On Dec 21, 2022, at 9:08 PM, Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> 
> On 12/16/22 5:57 PM, Aditi Ghag wrote:
>> The socket destroy helper is used to
>> forcefully terminate sockets from certain
>> BPF contexts. We plan to use the capability
>> in Cilium to force client sockets to reconnect
>> when their remote load-balancing backends are
>> deleted. The other use case is on-the-fly
>> policy enforcement where existing socket
>> connections prevented by policies need to
>> be terminated.
>> The helper is currently exposed to iterator
>> type BPF programs where users can filter,
>> and terminate a set of sockets.
>> Sockets are destroyed asynchronously using
>> the work queue infrastructure. This allows
>> for current the locking semantics within
>> socket destroy handlers, as BPF iterators
>> invoking the helper acquire *sock* locks.
>> This also allows the helper to be invoked
>> from non-sleepable contexts.
>> The other approach to skip acquiring locks
>> by passing an argument to the `diag_destroy`
>> handler didn't work out well for UDP, as
>> the UDP abort function internally invokes
>> another function that ends up acquiring
>> *sock* lock.
>> While there are sleepable BPF iterators,
>> these are limited to only certain map types.
> 
> bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs.
> 
> At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context.
> 
> For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort.  I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock.  This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@xxxxxx/): batch, release the bucket's lock, lock the sock, and then call bpf prog.  This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  That will also allow udp bpf-iter prog to directly do '.diag_destroy'.

Pushed v2 patch series that implements the missing batching support for UDP iterators. Sorry for the delay. 
Thanks for the well-documented batching code in TCP!