On Sun, 26 Apr 2020 at 18:33, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Apr 24, 2020 at 07:55:55PM +0100, Lorenz Bauer wrote: > > cls_redirect is a TC clsact based replacement for the glb-redirect iptables > > module available at [1]. It enables what GitHub calls "second chance" > > flows [2], similarly proposed by the Beamer paper [3]. In contrast to > > glb-redirect, it also supports migrating UDP flows as long as connected > > sockets are used. cls_redirect is in production at Cloudflare, as part of > > our own L4 load balancer. > > This is awesome. Thank you for contributing! > Applied. > > > There are two major distinctions from glb-redirect: first, cls_redirect > > relies on receiving encapsulated packets directly from a router. This is > > because we don't have access to the neighbour tables from BPF, yet. See > > Let's make it available then :) Yes, I've been dragging my feet on this. It seems like the fib_lookup does almost what we want, but I need to experiment with it. In the best case, we'd have a helper that does the following: * Try and find a neighbour * Return it if one exists * Otherwise, asynchronously trigger neighbour discovery (if it makes sense) * And return the default gateway I should probably start a new thread about this on the list. > > > The code base started it's life on v4.19, so there are most likely still > > hold overs from old workarounds. In no particular order: > > > > - The function buf_off is required to defeat a clang optimization > > that leads to the verifier rejecting the program due to pointer > > arithmetic in the wrong order. > > > > - The function pkt_parse_ipv6 is force inlined, because it would > > otherwise be rejected due to returning a pointer to stack memory. > > > > - The functions fill_tuple and classify_tcp contain kludges, because > > we've run out of function arguments. > > > > - The logic in general is rather nested, due to verifier restrictions. > > I think this is either because the verifier loses track of constants > > on the stack, or because it can't track enum like variables. > > Have you tried undoing these workarounds to check the latest verifier? > If they're still needed we certainly have to improve the verifier. > > > +#include "test_cls_redirect.skel.h" > > Do you use skeleton internally as well or was it just for selftests? ;) Only for selftests :) Our internal tooling is all Go based. skeleton is really nice though, so I'll make sure to steal some ideas! > > > +_Static_assert( > > + sizeof(flow_ports_t) != > > + offsetofend(struct bpf_sock_tuple, ipv4.dport) - > > + offsetof(struct bpf_sock_tuple, ipv4.sport) - 1, > > + "flow_ports_t must match sport and dport in struct bpf_sock_tuple"); > > Nice. I didn't realize clang supports it. Of course it should. > > > +/* Linux packet pointers are either aligned to NET_IP_ALIGN (aka 2 bytes), > > + * or not aligned if the arch supports efficient unaligned access. > > + * > > + * Since the verifier ensures that eBPF packet accesses follow these rules, > > + * we can tell LLVM to emit code as if we always had a larger alignment. > > + * It will yell at us if we end up on a platform where this is not valid. > > + */ > > +typedef uint8_t *net_ptr __attribute__((align_value(8))); > > Wow. I didn't know about this attribute. > I wonder whether it can help Daniel's memcpy hack. Yes, I think so. > > + > > +typedef struct buf { > > + struct __sk_buff *skb; > > + net_ptr head; > > + /* NB: tail musn't have alignment other than 1, otherwise > > + * LLVM will go and eliminate code, e.g. when checking packet lengths. > > + */ > > + uint8_t *const tail; > > +} buf_t; > > + > > +static size_t buf_off(const buf_t *buf) > > +{ > > + /* Clang seems to optimize constructs like > > + * a - b + c > > + * if c is known: > > + * r? = c > > + * r? -= b > > + * r? += a > > + * > > + * This is a problem if a and b are packet pointers, > > + * since the verifier allows subtracting two pointers to > > + * get a scalar, but not a scalar and a pointer. > > + * > > + * Use inline asm to break this optimization. > > + */ > > + size_t off = (size_t)buf->head; > > + asm("%0 -= %1" : "+r"(off) : "r"(buf->skb->data)); > > + return off; > > +} > > We need to look into this one. > This part is not gated by allow_ptr_leaks. > if (dst_reg == off_reg) { > /* scalar -= pointer. Creates an unknown scalar */ > verbose(env, "R%d tried to subtract pointer from scalar\n", > dst); > return -EACCES; > } > Hopefully not too hard to fix. > > > + > > +static bool pkt_skip_ipv6_extension_headers(buf_t *pkt, > > + const struct ipv6hdr *ipv6, > > + uint8_t *upper_proto, > > + bool *is_fragment) > > +{ > > + /* We understand five extension headers. > > + * https://tools.ietf.org/html/rfc8200#section-4.1 states that all > > + * headers should occur once, except Destination Options, which may > > + * occur twice. Hence we give up after 6 headers. > > + */ > > + struct { > > + uint8_t next; > > + uint8_t len; > > + } exthdr = { > > + .next = ipv6->nexthdr, > > + }; > > + *is_fragment = false; > > + > > +#pragma clang loop unroll(full) > > + for (int i = 0; i < 6; i++) { > > unroll is still needed even with bounded loop support in the verifier? I've just tried removing this on bpf-next. pkt_ipv4_checksum works without the pragma, pkt_skip_ipv6_extension_headers fails with the following message: libbpf: load bpf program failed: Invalid argument libbpf: -- BEGIN DUMP LOG --- libbpf: 476: for (int i = 0; i < 6; i++) { 477: switch (exthdr.next) { back-edge from insn 476 to 477 processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 libbpf: -- END LOG -- libbpf: failed to load program 'classifier/cls_redirect' libbpf: failed to load object 'test_cls_redirect' libbpf: failed to load BPF skeleton 'test_cls_redirect': -4007 test_cls_redirect:FAIL:385 #10 cls_redirect:FAIL Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED > > > > +/* This function has to be inlined, because the verifier otherwise rejects it > > + * due to returning a pointer to the stack. This is technically correct, since > > + * scratch is allocated on the stack. However, this usage should be safe since > > + * it's the callers stack after all. > > + */ > > Interesting. The verifier can recognize that ptr to stack can be safe in some cases. > When I added that check I didn't think that the code can be as tricky as this :) > > > +static verdict_t process_ipv4(buf_t *pkt, metrics_t *metrics) > > +{ > > + switch (ipv4->protocol) { > > + case IPPROTO_ICMP: > > + return process_icmpv4(pkt, metrics); > > + > > + case IPPROTO_TCP: > > + return process_tcp(pkt, ipv4, sizeof(*ipv4), metrics); > > + > > + case IPPROTO_UDP: > > + return process_udp(pkt, ipv4, sizeof(*ipv4), metrics); > > + > > + default: > > + metrics->errors_total_unknown_l4_proto++; > > + return INVALID; > > + } > > +} > > + > > +static verdict_t process_ipv6(buf_t *pkt, metrics_t *metrics) > > + if (is_fragment) { > > + metrics->errors_total_fragmented_ip++; > > + return INVALID; > > + } > > + > > + switch (l4_proto) { > > + case IPPROTO_ICMPV6: > > + return process_icmpv6(pkt, metrics); > > + > > + case IPPROTO_TCP: > > + return process_tcp(pkt, ipv6, sizeof(*ipv6), metrics); > > + > > + case IPPROTO_UDP: > > + return process_udp(pkt, ipv6, sizeof(*ipv6), metrics); > > + > > + default: > > + metrics->errors_total_unknown_l4_proto++; > > + return INVALID; > > + } > > +} > > + > > +SEC("classifier/cls_redirect") > > +int cls_redirect(struct __sk_buff *skb) > > +{ > ... > > + verdict_t verdict; > > + switch (encap->gue.proto_ctype) { > > + case IPPROTO_IPIP: > > + verdict = process_ipv4(&pkt, metrics); > > + break; > > + > > + case IPPROTO_IPV6: > > + verdict = process_ipv6(&pkt, metrics); > > + break; > > The code flow looks pretty clean. > I didn't find the copy-paste of ipv[46] -> tcp/udp > you were mentioning. > So that old issue is now gone? The biggest offenders are fill_tuple (which is purely a hack) as well as classify_tcp. I'd really like to call classify_tcp from cls_redirect(), using saved packet pointers and lengths. Right now the function is called starting from process_ipv4 and process_ipv6, which means all of those arguments have to be passed down somehow. -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com