RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)

John Fastabend <john.fastabend@xxxxxxxxx> · Wed, 28 Feb 2024 09:11:17 -0800

Jamal Hadi Salim wrote:
> This is the first patchset of two. In this patch we are submitting 15 which
> cover the minimal viable P4 PNA architecture.
> 
> __Description of these Patches__
> 
> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
> as need basis for the P4 program requirement. This patch makes a small incision
> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
> effect the classical tc action (example patch#2 just increases the size of the
> action names from 16->64B).
> Patch 5 adds infrastructure support for preallocation of dynamic actions.
> 
> The core P4TC code implements several P4 objects.
> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>    for P4 pipelines.
> 4) Patch #9 introduces the action templates and associated CRUD commands.
> 5) Patch #10 introduce the action runtime infrastructure.
> 6) Patch #11 introduces the concept of P4 table templates and associated
>    CRUD commands for tables.
> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
> 10) Patch #15 introduces the TC classifier P4 used at runtime.
> 
> Daniel, please look again at patch #15.
> 
> There are a few more patches (5) not in this patchset that deal with test
> cases, etc.
> 
> What is P4?
> -----------
> 
> The Programming Protocol-independent Packet Processors (P4) is an open source,
> domain-specific programming language for specifying data plane behavior.
> 
> The current P4 landscape includes an extensive range of deployments, products,
> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
> currently offer P4-native NICs. P4 is currently curated by the Linux
> Foundation[9].
> 
> On why P4 - see small treatise here:[4].
> 
> What is P4TC?
> -------------
> 
> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
> and its associated objects and state are attachend to a kernel _netns_ structure.
> IOW, if we had two programs across netns' or within a netns they have no
> visibility to each others objects (unlike for example TC actions whose kinds are
> "global" in nature or eBPF maps visavis bpftool).

[...]

Although I appreciate a good amount of work went into building above I'll
add my concerns here so they are not lost. These are architecture concerns
not this line of code needs some tweak.

 - It encodes a DSL into the kernel. Its unclear how we pick which DSL gets
   pushed into the kernel and which do not. Do we take any DSL folks can code
   up?
   I would prefer a lower level  intermediate langauge. My view is this is
   a lesson we should have learned from OVS. OVS had wider adoption and
   still struggled in some ways my belief is this is very similar to OVS.
   (Also OVS was novel/great at a lot of things fwiw.)

 - We have a general purpose language in BPF that can implement the P4 DSL
   already. I don't see any need for another set of code when the end goal
   is running P4 in Linux network stack is doable. Typically we reject
   duplicate things when they don't have concrete benefits.

 - P4 as a DSL is not optimized for general purpose CPUs, but
   rather hardware pipelines. Although it can be optimized for CPUs its
   a harder problem. A review of some of the VPP/DPDK work here is useful.

 - P4 infrastructure already has a p4c backend this is adding another P4
   backend instead of getting the rather small group of people to work on
   a single backend we are now creating another one.

 - Common reasons I think would justify a new P4 backend and implementation
   would be: speed efficiency, or expressiveness. I think this
   implementation is neither more efficient nor more expressive. Concrete
   examples on expressiveness would be interesting, but I don't see any.
   Loops were mentioned once but latest kernels have loop support.

 - The main talking point for many slide decks about p4tc is hardware
   offload. This seems like the main benefit of pushing the P4 DSL into the
   kernel. But, we have no hw implementation, not even a vendor stepping up
   to comment on this implementation and how it will work for them. HW
   introduces all sorts of interesting problems that I don't see how we
   solve in this framework. For example a few off the top of my head:
   syncing current state into tc, how does operator program tc inside
   constraints, who writes the p4 models for these hardware devices, do
   they fit into this 'tc' infrastructure, partial updates into hardware
   seems unlikely to work for most hardware, ...

 - The kfuncs are mostly duplicates of map ops we already have in BPF API.
   The motivation by my read is to use netlink instead of bpf commands. I
   don't agree with this, optimizing for some low level debug a developer
   uses is the wrong design space. Actual users should not be deploying
   this via ssh into boxes. The workflow will not scale and really we need
   tooling and infra to land P4 programs across the network. This is orders
   of more pain if its an endpoint solution and not a middlebox/switch
   solution. As a switch solution I don't see how p4tc sw scales to even TOR
   packet rates. So you need tooling on top and user interact with the
   tooling not the Linux widget/debugger at the bottom.

 - There is no performance analysis: The comment was functionality before
   performance which I disagree with. If it was a first implementation and
   we didn't have a way to do P4 DSL already than I might agree, but here
   we have an existing solution so it should be at least as good and should
   be better than existing backend. A software datapath adoption is going
   to be critically based on performance. I don't see taking even a 5% hit
   when porting over to P4 from existing datapath.

Commentary: I think its 100% correct to debate how the P4 DSL is
implemented in the kernel. I can't see why this is off limits somehow this
patch set proposes an approach there could be many approaches. BPF comes up
not because I'm some BPF zealot that needs P4 DSL in BPF, but because it
exists today there is even a P4 backend. Fundamentally I don't see the
value add we get by creating two P4 pipelines this is going to create
duplication all the way up to the P4 tooling/infra through to the kernel.
>From your side you keep saying I'm bike shedding and demanding BPF, but
from my perspective your introducing another entire toolchain simply
because you want some low level debug commands that 99% of P4 users should
not be using or caring about.

To try and be constructive some things that would change my mind would
be a vendor showing how hardware can be used. This would be compelling.
Or performance showing its somehow gets a more performant implementation.
Or lastly if the current p4c implementation is fundamentally broken
somehow.

Thanks
John