On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali <anjali.singhai@xxxxxxxxx> wrote: > > >From: John Fastabend <john.fastabend@xxxxxxxxx> > >Sent: Tuesday, May 28, 2024 1:17 PM > > >Jain, Vipin wrote: > >> [AMD Official Use Only - AMD Internal Distribution Only] > >> > >> My apologies, earlier email used html and was blocked by the list... > >> My response at the bottom as "VJ>" > >> > >> ________________________________________ > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices? > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works. > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions. > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context. > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath. > > >.John > > > John, > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware. > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC. > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings. > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths. > We feel P4TC approach is the path to add Linux kernel support. > > The s/w path is needed as well for several reasons. > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath. Hi Anjali, Are there any use cases of P4-TC that don't involve P4 hardware? If someone wanted to write one off datapath code for their deployment and they didn't have P4 hardware would you suggest that they write they're code in P4-TC? The reason I ask is because I'm concerned about the performance of P4-TC. Like John said, this is mapping code that is intended to run in specialized hardware into a CPU, and it's also interpreted execution in TC. The performance numbers in https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf seem to show that P4-TC has about half the performance of XDP. Even with a lot of work, it's going to be difficult to substantially close that gap. The risk if we allow this into the kernel is that a vendor might be tempted to point to P4-TC performance as a baseline to justify to customers that they need to buy specialized hardware to get performance, whereas if XDP was used maybe they don't need the performance and cost of hardware. Note, this scenario already happened once before, when the DPDK joined LF they made bogus claims that they got a 100x performance over the kernel-- had they put at least the slightest effort into tuning the kernel that would have dropped the delta by an order of magnitude, and since then we've pretty much closed the gap (actually, this is precisely what motivated the creation of XDP so I guess that story had a happy ending!) . There are circumstances where hardware offload may be warranted, but it needs to be honestly justified by comparing it to an optimized software solution-- so in the case of P4, it should be compared to well written XDP code for instance, not P4-TC. Tom > > > Anjali >