IR for Programmable Datapaths [WAS Re: On the NACKs on P4TC patches]

Tom Herbert <tom@xxxxxxxxxx> · Tue, 28 May 2024 18:55:29 -0700

> None of above requires P4TC. For different architectures you
> build optimal backend compilers. You have a Xilenx backend,
> an Intel backend, and a Linux CPU based backend. I see no
> reason to constrain the software case to map to a pipeline
> model for example. Software running on a CPU has very different
> characteristics from something running on a TOR, or FPGA.
> Trying to push all these into one backend "model" will result
> in suboptimal result for every target. At the end of the
> day my .02$, P4 is a DSL it needs a target dependent compiler
> in front of it. I want to optimize my software pipeline the
> compiler should compress tables as much as possible and
> search for a O(1) lookup even if getting that key is somewhat
> expensive. Conversely a TCAM changes the game. An FPGA is
> going to be flexible and make lots of tradeoffs here of which
> I'm not an expert. Also by avoiding loading the DSL into the kernel
> you leave room for others to build new/better/worse DSLs as they
> please.
>

I think the general ask here is to define an Intermediate
Representation that describes a programmed data path where it's a
combination of declarative and imperative elements (parsers and table
descriptions are better in declarative representation, functional
logic seems more imperative). We also want references to accelerators
with dynamic runtime binding to hardware (there are some interesting
tricks we can do in the loader for a CPU target-- will talk about at
Netdev). With a good IR we can decouple the frontend from the backend
target which enables mixing and matching programming languages with
arbitrary HW or SW targets. So a good IR potentially enables a lot of
flexibility and freedom on both sides of the equation.

An IR also facilitates reasonable kernel offload via signing images
with a hash of the IR. So for instance, a frontend compiler could
compile a P4 program into the IR. That code could then be compiled
into a SW target, say eBPF, and maybe P4 hardware. Each image has the
hash of the IR. At runtime, the eBPF code could be loaded into the
kernel. The hardware image can be loaded into the device using a side
band mechanism. To offload, we would query the device-- if the hash
reported by the device matches the hash in the eBPF then we know that
the offload is viable. No jits, no pushing firmware bits through the
kernel, no need for device capabilities flags, and avoids the pitfalls
of TC flower.

There is one challenge here in how to deal with offloads that are
already integrated into the kernel. I think GRO is a great example.
GRO has been especially elusive as an offload since it requires a
device to autonomously parse packets on input.  We really want a GRO
offload that parses the same exact protocols the kernel does
(including encapsulations), but also implements the exact same logic
in timers and pushing reassembled segments. So this needs to be
programmable. The problem with the technique I described is that GRO
is integrated into the kernel so we have no basis for a hash. I think
the answer here is to start replacing fixed kernel C code with eBPF
even in the critical path (we already talked about replacing flow
dissector with eBPF).

Anyway, we have been working on this. There's Common Parser
Representation in json (formerly known CPL that we talked about at
Netdev). For execution logic, LLVM IR seems fine (btrw, MLIR is really
useful by the way!). We're just starting to look at tables (probably
also json). If there's interest I could share more...

Tom