On Mon, Mar 4, 2024 at 1:19 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > On 03/03, Tom Herbert wrote: > > On Sat, Mar 2, 2024 at 7:15 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > > > > > > On Fri, 1 Mar 2024 18:20:36 -0800 Tom Herbert wrote: > > > > This is configurability versus programmability. The table driven > > > > approach as input (configurability) might work fine for generic > > > > match-action tables up to the point that tables are expressive enough > > > > to satisfy the requirements. But parsing doesn't fall into the table > > > > driven paradigm: parsers want to be *programmed*. This is why we > > > > removed kParser from this patch set and fell back to eBPF for parsing. > > > > But the problem we quickly hit that eBPF is not offloadable to network > > > > devices, for example when we compile P4 in an eBPF parser we've lost > > > > the declarative representation that parsers in the devices could > > > > consume (they're not CPUs running eBPF). > > > > > > > > I think the key here is what we mean by kernel offload. When we do > > > > kernel offload, is it the kernel implementation or the kernel > > > > functionality that's being offloaded? If it's the latter then we have > > > > a lot more flexibility. What we'd need is a safe and secure way to > > > > synchronize with that offload device that precisely supports the > > > > kernel functionality we'd like to offload. This can be done if both > > > > the kernel bits and programmed offload are derived from the same > > > > source (i.e. tag source code with a sha-1). For example, if someone > > > > writes a parser in P4, we can compile that into both eBPF and a P4 > > > > backend using independent tool chains and program download. At > > > > runtime, the kernel can safely offload the functionality of the eBPF > > > > parser to the device if it matches the hash to that reported by the > > > > device > > > > > > Good points. If I understand you correctly you're saying that parsers > > > are more complex than just a basic parsing tree a'la u32. > > > > Yes. Parsing things like TLVs, GRE flag field, or nested protobufs > > isn't conducive to u32. We also want the advantages of compiler > > optimizations to unroll loops, squash nodes in the parse graph, etc. > > > > > Then we can take this argument further. P4 has grown to encompass a lot > > > of functionality of quite complex devices. How do we square that with > > > the kernel functionality offload model. If the entire device is modeled, > > > including f.e. TSO, an offload would mean that the user has to write > > > a TSO implementation which they then load into TC? That seems odd. > > > > > > IOW I don't quite know how to square in my head the "total > > > functionality" with being a TC-based "plugin". > > > > Hi Jakub, > > > > I believe the solution is to replace kernel code with eBPF in cases > > where we need programmability. This effectively means that we would > > ship eBPF code as part of the kernel. So in the case of TSO, the > > kernel would include a standard implementation in eBPF that could be > > compiled into the kernel by default. The restricted C source code is > > tagged with a hash, so if someone wants to offload TSO they could > > compile the source into their target and retain the hash. At runtime > > it's a matter of querying the driver to see if the device supports the > > TSO program the kernel is running by comparing hash values. Scaling > > this, a device could support a catalogue of programs: TSO, LRO, > > parser, IPtables, etc., If the kernel can match the hash of its eBPF > > code to one reported by the driver then it can assume functionality is > > offloadable. This is an elaboration of "device features", but instead > > of the device telling us they think they support an adequate GRO > > implementation by reporting NETIF_F_GRO, the device would tell the > > kernel that they not only support GRO but they provide identical > > functionality of the kernel GRO (which IMO is the first requirement of > > kernel offload). > > > > Even before considering hardware offload, I think this approach > > addresses a more fundamental problem to make the kernel programmable. > > Since the code is in eBPF, the kernel can be reprogrammed at runtime > > which could be controlled by TC. This allows local customization of > > kernel features, but also is the simplest way to "patch" the kernel > > with security and bug fixes (nobody is ever excited to do a kernel > > [..] > > > rebase in their datacenter!). Flow dissector is a prime candidate for > > this, and I am still planning to replace it with an all eBPF program > > (https://netdevconf.info/0x15/slides/16/Flow%20dissector_PANDA%20parser.pdf). > > So you're suggesting to bundle (and extend) > tools/testing/selftests/bpf/progs/bpf_flow.c? We were thinking along > similar lines here. We load this program manually right now, shipping > and autoloading with the kernel will be easer. Hi Stanislav, Yes, I envision that we would have a standard implementation of flow-dissector in eBPF that is shipped with the kernel and autoloaded. However, for the front end source I want to move away from imperative code. As I mentioned in the presentation flow_dissector.c is spaghetti code and has been prone to bugs over the years especially whenever someone adds support for a new fringe protocol (I take the liberty to call it spaghetti code since I'm partially responsible for creating this mess ;-) ). The problem is that parsers are much better represented by a declarative rather than an imperative representation. To that end, we defined PANDA which allows constructing a parser (parse graph) in data structures in C. We use the "PANDA parser" to compile C to restricted C code which looks more like eBPF in imperative code. With this method we abstract out all the bookkeeping that was often the source of bugs (like pulling up skbufs, checking length limits, etc.). The other advantage is that we're able to find a lot more optimizations if we start with a right representation of the problem. If you're interested, the video presentation on this is in https://www.youtube.com/watch?v=zVnmVDSEoXc. Tom