On 12/22/22 6:08 AM, Martin KaFai Lau wrote:
On 12/16/22 5:57 PM, Aditi Ghag wrote:
The socket destroy helper is used to
forcefully terminate sockets from certain
BPF contexts. We plan to use the capability
in Cilium to force client sockets to reconnect
when their remote load-balancing backends are
deleted. The other use case is on-the-fly
policy enforcement where existing socket
connections prevented by policies need to
be terminated.
The helper is currently exposed to iterator
type BPF programs where users can filter,
and terminate a set of sockets.
Sockets are destroyed asynchronously using
the work queue infrastructure. This allows
for current the locking semantics within
socket destroy handlers, as BPF iterators
invoking the helper acquire *sock* locks.
This also allows the helper to be invoked
from non-sleepable contexts.
The other approach to skip acquiring locks
by passing an argument to the `diag_destroy`
handler didn't work out well for UDP, as
the UDP abort function internally invokes
another function that ends up acquiring
*sock* lock.
While there are sleepable BPF iterators,
these are limited to only certain map types.
bpf-iter program can be sleepable and non sleepable. Both sleepable and non sleepable tcp/unix bpf-iter programs have been able to call bpf_setsockopt() synchronously. bpf_setsockopt() also requires the sock lock to be held first. The situation on calling '.diag_destroy' from bpf-iter should not be much different from calling bpf_setsockopt(). From a quick look at tcp_abort and udp_abort, I don't see they might sleep also and you may want to double check. Even '.diag_destroy' was only usable in sleepable 'bpf-iter' because it might sleep, the common bpf map types are already available to the sleepable programs.
At the kernel side, the tcp and unix iter acquire the lock_sock() first (eg. in bpf_iter_tcp_seq_show()) before calling the bpf-iter prog . At the kernel setsockopt code (eg. do_ip_setsockopt()), it uses sockopt_lock_sock() and avoids doing the lock if it has already been guaranteed by the bpf running context.
For udp, I don't see how the udp_abort acquires the sock lock differently from tcp_abort. I assume the actual problem seen in udp_abort is related to the '->unhash()' part which acquires the udp_table's bucket lock. This is a problem for udp bpf-iter only because the udp bpf-iter did not release the udp_table's bucket lock before calling the bpf prog. The tcp (and unix) bpf-iter releases the bucket lock first before calling the bpf prog. This was done explicitly to allow acquiring the sock lock before calling the bpf prog because otherwise it will have a lock ordering issue. Hence, for this reason, bpf_setsockopt() is only available to tcp and unix bpf-iter but not udp bpf-iter. The udp-iter needs to do similar change like the tcp and unix iter (https://lore.kernel.org/bpf/20210701200535.1033513-1-kafai@xxxxxx/): batch, release the bucket's lock, lock the sock, and then call bpf prog. This will allow udp-iter to call bpf_setsockopt() like its tcp and unix counterpart.
That will also allow udp bpf-iter prog to directly do '.diag_destroy'.
Agree, that sounds very reasonable way forward and would also enable bpf_setsockopt() support
for UDP sks. Wrt the workqueue I don't really like that there's a loophole with the -EBUSY if
the callback didn't trigger yet, while we need to be able to destroy multiple sks in one iteration
when they match the sk cookie previously recorded in the map.. so agree, this needs to happen in
a synchronous manner.
Thanks,
Daniel