Re: [PATCH 1/2] bpf: Add socket destroy capability

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Wed, 21 Dec 2022 21:08:53 -0800

On 12/16/22 5:57 PM, Aditi Ghag wrote:
The socket destroy helper is used to
forcefully terminate sockets from certain
BPF contexts. We plan to use the capability
in Cilium to force client sockets to reconnect
when their remote load-balancing backends are
deleted. The other use case is on-the-fly
policy enforcement where existing socket
connections prevented by policies need to
be terminated.

The helper is currently exposed to iterator
type BPF programs where users can filter,
and terminate a set of sockets.

Sockets are destroyed asynchronously using
the work queue infrastructure. This allows
for current the locking semantics within
socket destroy handlers, as BPF iterators
invoking the helper acquire *sock* locks.
This also allows the helper to be invoked
from non-sleepable contexts.
The other approach to skip acquiring locks
by passing an argument to the `diag_destroy`
handler didn't work out well for UDP, as
the UDP abort function internally invokes
another function that ends up acquiring
*sock* lock.
While there are sleepable BPF iterators,
these are limited to only certain map types.

bpf-iter program can be sleepable and non sleepable. Both sleepable and non 
sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() 
synchronously. bpf_setsockopt() also requires the sock lock to be held first. 
The situation on calling '.diag_destroy' from bpf-iter should not be much 
different from calling bpf_setsockopt(). From a quick look at tcp_abort and 
udp_abort, I don't see they might sleep also and you may want to double check. 
Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might 
sleep, the common bpf map types are already available to the sleepable programs.

At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in 
bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel 
setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids 
doing the lock if it has already been guaranteed by the bpf running context.

For udp, I don't see how the udp_abort acquires the sock lock differently from 
tcp_abort.  I assume the actual problem seen in udp_abort is related to the 
'->unhash()' part which acquires the udp_table's bucket lock.  This is a problem 
for udp bpf-iter only because the udp bpf-iter did not release the udp_table's 
bucket lock before calling the bpf prog.  The tcp (and unix) bpf-iter releases 
the bucket lock first before calling the bpf prog. This was done explicitly to 
allow acquiring the sock lock before calling the bpf prog because otherwise it 
will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is 
only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs 
to do similar change like the tcp and unix iter 
(https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@xxxxxx/): batch, 
release the bucket's lock, lock the sock, and then call bpf prog.  This will 
allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.  That 
will also allow udp bpf-iter prog to directly do '.diag_destroy'.