Re: On the NACKs on P4TC patches

Tom Herbert <tom@xxxxxxxxxx> · Tue, 28 May 2024 16:01:40 -0700

On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
<anjali.singhai@xxxxxxxxx> wrote:
>
> >From: John Fastabend <john.fastabend@xxxxxxxxx>
> >Sent: Tuesday, May 28, 2024 1:17 PM
>
> >Jain, Vipin wrote:
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> My apologies, earlier email used html and was blocked by the list...
> >> My response at the bottom as "VJ>"
> >>
> >> ________________________________________
>
> >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
>
> >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
>
> >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
>
> >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
>
> >.John
>
>
> John,
> Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> We feel P4TC approach is the path to add Linux kernel support.
>
> The s/w path is needed as well for several reasons.
> We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.

Hi Anjali,

Are there any use cases of P4-TC that don't involve P4 hardware? If
someone wanted to write one off datapath code for their deployment and
they didn't have P4 hardware would you suggest that they write they're
code in P4-TC? The reason I ask is because I'm concerned about the
performance of P4-TC. Like John said, this is mapping code that is
intended to run in specialized hardware into a CPU, and it's also
interpreted execution in TC. The performance numbers in
https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf
seem to show that P4-TC has about half the performance of XDP. Even
with a lot of work, it's going to be difficult to substantially close
that gap.

The risk if we allow this into the kernel is that a vendor might be
tempted to point to P4-TC performance as a baseline to justify to
customers that they need to buy specialized hardware to get
performance, whereas if XDP was used maybe they don't need the
performance and cost of hardware. Note, this scenario already happened
once before, when the DPDK joined LF they made bogus claims that they
got a 100x performance over the kernel-- had they put at least the
slightest effort into tuning the kernel that would have dropped the
delta by an order of magnitude, and since then we've pretty much
closed the gap (actually, this is precisely what motivated the
creation of XDP so I guess that story had a happy ending!) . There are
circumstances where hardware offload may be warranted, but it needs to
be honestly justified by comparing it to an optimized software
solution-- so in the case of P4, it should be compared to well written
XDP code for instance, not P4-TC.

Tom

>
>
> Anjali
>