Re: [RFC] Socket termination for policy enforcement and load-balancing

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Wed, 7 Sep 2022 19:26:34 -0700

On 9/4/22 2:24 PM, Kumar Kartikeya Dwivedi wrote:
On Sun, 4 Sept 2022 at 20:55, Aditi Ghag <aditivghag@xxxxxxxxx> wrote:

On Wed, Aug 31, 2022 at 4:02 PM Martin KaFai Lau <kafai@xxxxxx> wrote:

On Wed, Aug 31, 2022 at 09:37:41AM -0700, Aditi Ghag wrote:
- Use BPF (sockets) iterator to identify sockets connected to a
deleted backend. The BPF (sockets) iterator is network namespace aware
so we'll either need to enter every possible container network
namespace to identify the affected connections, or adapt the iterator
to be without netns checks [3]. This was discussed with my colleague
Daniel Borkmann based on the feedback he shared from the LSFMMBPF
conference discussions.
Being able to iterate all sockets across different netns will
be useful.

It should be doable to ignore the netns check.  For udp, a quick
thought is to have another iter target. eg. "udp_all_netns".
 From the sk, the bpf prog should be able to learn the netns and
the bpf prog can filter the netns by itself.

The TCP side is going to have an 'optional' per netns ehash table [0] soon,
not lhash2 (listening hash) though.  Ideally, the same bpf
all-netns iter interface should work similarly for both udp and
tcp case.  Thus, both should be considered and work at the same time.

For udp, something more useful than plain udp_abort() could potentially
be done.  eg. directly connect to another backend (by bpf kfunc?).
There may be some details in socket locking...etc but should
be doable and the bpf-iter program could be sleepable also.

This won't be effective for connected udp though, will it? Interesting thought
around using bpf kfunchmm... why the bpf-prog doing the udp re-connect() won't be effective? 
I suspect we are talking about different thing.

Regardless, for tcp, I think the user space needs to handle the tcp 
aborted-error by redoing the connect().  Thus, lets stay with 
{tcp,udp}_abort() for now.  Try to expose {tcp,udp}_abort() as a kfunc 
instead of a new bpf_helper.

fwiw, we are iterating the tcp socket to retire some older
bpf-tcp-cc (congestion control) on the long-lived connections
by bpf_setsockopt(TCP_CONGESTION).

Also, potentially, instead of iterating all,
a more selective case can be done by
bpf_prog_test_run()+bpf_sk_lookup_*()+udp_abort().

Can you elaborate more on the more selective iterator approach?
If the 4 tuples (src/dst ip/port) is known, bpf_sk_lookup_*() can lookup 
a sk from the tcp_hashinfo or udp_table.  bpf_sk_lookup_*() also takes a 
netns_id argument.  However, yeah, it will still go back to the need to 
get all netns, so may not work well in the RFC case here.

On a similar note, are there better ways as alternatives to the
sockets iterator approach.
Since we have BPF programs executed on cgroup BPF hooks (e.g.,
connect), we already know what client
sockets are connected to a backend. Can we somehow store these socket
pointers in a regular BPF map, and
when a backend is deleted, use a regular map iterator to invoke
sock_destroy() for these sockets? Does anyone have
experience using the "typed pointer support in BPF maps" APIs [0]?

I am not very familiar with how socket lifetime is managed, it may not
be possible in case lifetime is managed by RCU only,
or due to other limitations.
Martin will probably be able to comment more on that.
sk is the usual refcnt+rcu_reader pattern.  afaik, the use case here is 
the sk should be removed from the map when there is a tcp_close() or 
udp_lib_close().  There is sock_map and sock_hash to store sk as the 
map-value.  iirc the sk will be automatically removed from the map 
during tcp_close() and udp_lib_close().  The sock_map and sock_hash have 
bpf iterator also.  Meaning a bpf-iter-prog can iterate the sock_map and 
sock_hash and then do abort on each sk, so it looks like most of the 
pieces are in place.