Hey Florian, Thanks for taking a look at it. On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote: > Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> wrote: >> - XDP programs using bpf_sk_lookup helpers, like load balancers, can't >> find the listening socket to check for SYN cookies with TPROXY redirect. > > Sorry for the question, but where is the problem? > (i.e., is it with TPROXY or bpf side)? The way I see it is that the problem is that we have mappings for steering traffic into sockets split between two places: (1) the socket lookup tables, and (2) the TPROXY rules. BPF programs that need to check if there is a socket the packet is destined for have access to the socket lookup tables, via the mentioned bpf_sk_lookup helper, but are unaware of TPROXY redirects. For TCP we're able to look up from BPF if there are any established, request, and "normal" listening sockets. The listening sockets that receive connections via TPROXY are invisible to BPF progs. Why are we interested in finding all listening sockets? To check if any of them had SYN queue overflow recently and if we should honor SYN cookies. >> - TPROXY takes a reference to the listening socket on dispatch, which >> raises lock contention concerns. > > FWIW this could be avoided in similar way as to how we handle noref dsts. > > The only reason we need to take the reference at the moment is because > once skb leaves the TPROXY target hook, the skb could leave rcu > protection as well at some point (nfqueue for example). > > Maybe its even enough to move reference taking to nfqueue and add > 'noref' destructor, that would allow skb_steal_sock to propagate > refcounted value in __inet_lookup_skb. > > So, at least for this part I don't see a technical reason why this > has to grab a reference for listener socket. That's helpful, thanks! We rely on TPROXY, so I would like to help with that. Let me see if I can get time to work on it. > >> - Traffic steering configuration is split over several iptables rules, at >> least one per service, which makes configuration changes error prone. > > Could you perhaps sketch an example ruleset (doesn't have to be complete > nor parse-able by itpables-restore), I would just like to understand if > there is any room for improvement on netfilter/iptables/nft side. Happy to. Scenarios that are of interest to us: 1) Port sharing, while accepting on a set of subnets (same are the demo BPF prog from cover letter) ip route add local 192.0.2.0/24 dev lo ip route add local 198.51.100.0/24 dev lo ip route add local 203.0.113.0/24 dev lo ipset create net1 hash:net ipset create net2 hash:net ipset create net3 hash:net ipset add net1 192.0.2.0/24 ipset add net2 198.51.100.0/24 ipset add net3 203.0.113.0/24 iptables -t mangle -A PREROUTING -p tcp --dport 80 \ -m set --match-set net1 dst \ -j TPROXY --on-ip=127.0.0.1 --on-port=81 iptables -t mangle -A PREROUTING -p tcp --dport 80 \ -m set --match-set net2 dst \ -j TPROXY --on-ip=127.0.0.1 --on-port=82 2) Receving on all ports, except some iptables -t mangle -A PREROUTING -p tcp --dport 80 \ -m set --match-set net3 dst \ -j TPROXY --on-ip=127.0.0.1 --on-port=81 iptables -t mangle -A PREROUTING -p tcp \ -m set --match-set net3 dst \ -j TPROXY --on-ip=127.0.0.1 --on-port=1 3) Steering part of the traffic to a different socket (A/B testing) iptables -t mangle -A PREROUTING -p tcp \ -m set --match-set net3 dst \ -m statistic --mode random --probability 0.01 \ -j TPROXY --on-ip=127.0.0.1 --on-port=2 iptables -t mangle -A PREROUTING -p tcp \ -m set --match-set net3 dst \ -j TPROXY --on-ip=127.0.0.1 --on-port=1 One thing I haven't touched on in the cover letter is that to use TPROXY you need to set IP_TRANSPARENT on the listening socket. This requires that your process runs with CAP_NET_RAW or CAP_NET_ADMIN, or that you get the socket from systemd. I haven't been able to explain why the process needs to be privileged to receive traffic steered with TPROXY, but it turns out to be a pain point too. We end up having to lock down the service to ensure it doesn't use the elevated privileges for anything else than setting IP_TRANSPARENT. Thanks, Jakub