On Tue, 21 Sept 2021 at 17:06, Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Hi Lorenz (Cc. the other people who participated in today's discussion) > > Following our discussion at the LPC session today, I dug up my previous > summary of the issue and some possible solutions[0]. Seems no on > actually replied last time, which is why we went with the "do nothing" > approach, I suppose. I'm including the full text of the original email > below; please take a look, and let's see if we can converge on a > consensus here. Hi Toke, Thanks for looping me in again. A bit of context what XDP at Cloudflare looks like: * We have a chain of XDP programs attached to a real network device. This implements DDoS protection and L4 load balancing. This is maintained by the team I am on. * We have hundreds of network namespaces with veth that have XDP attached to them. Traffic is routed from the root namespace into these. This is maintained by the Magic Transit team, see this talk from last year's LPC [1] I'll try to summarise what I've picked up from the thread and add my own 2c. Options being considered: 1. Make sure mb-aware and mb-unaware programs don't mix. This could either be in the form of a sysctl or a dynamic property similar to a refcount. We'd need to discern mb-aware from mb-unaware somehow, most easily via a new program type. This means recompiling existing programs (but then we expect that to be necessary anyways). We'd also have to be able to indicate "mb-awareness" for freplace programs. The implementation complexity seems OK, but operator UX is not good: it's not possible to slowly migrate a system to mb-awareness, it has to happen in one fell swoop. This would be really problematic for us, since we already have multiple teams writing and deploying XDP independently of each other. This number is only going to grow. It seems there will also be trickiness around redirecting into different devices? Not something we do today, but it's kind of an obvious optimization to start redirecting into network namespaces from XDP instead of relying on routing. 2. Add a compatibility shim for mb-unaware programs receiving an mb frame. We'd still need a way to indicate "MB-OK", but it could be a piece of metadata on a bpf_prog. Whatever code dispatches to an XDP program would have to include a prologue that linearises the xdp_buff if necessary which implies allocating memory. I don't know how hard it is to implement this. There is also the question of freplace: do we extend linearising to them, or do they have to support MB? You raised an interesting point: couldn't we hit programs that can't handle data_end - data being above a certain length? I think we (= Cloudflare) actually have one of those, since we in some cases need to traverse the entire buffer to calculate a checksum (we encapsulate UDPv4 in IPv6, don't ask). Turns out it's actually really hard to calculate the checksum on a variable length packet in BPF so we've had to introduce limits. However, this case isn't too important: we made this choice consciously, knowing that MTU changes would break it. Other than that I like this option a lot: mb-aware and mb-unaware programs can co-exist, at the cost of performance. This allows gradually migrating to our stack so that it can handle jumbo frames. 3. Make non-linearity invisible to the BPF program Something I've wished for often is that I didn't have to deal with nonlinearity at all, based on my experience with cls_redirect [2]. It's really hard to write a BPF program that handles non-linear skb, especially when you have to call adjust_head, etc. which invalidates packet buffers. This is probably impossible, but maybe someone has a crazy idea? :) Lorenz 1: https://youtu.be/UkvxPyIJAko?t=10057 2: https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/bpf/progs/test_cls_redirect.c -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com