Re: On the NACKs on P4TC patches

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Thu, 30 May 2024 12:59:30 -0400

On Wed, May 29, 2024 at 10:46 AM Tom Herbert <tom@xxxxxxxxxx> wrote:
>
> On Wed, May 29, 2024 at 4:01 AM Jamal Hadi Salim <jhs@xxxxxxxxxxxx> wrote:
> >
> >
> >
> > On Tue, May 28, 2024 at 7:43 PM Chris Sommers <chris.sommers@xxxxxxxxxxxx> wrote:
> >>
> >> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
> >> > <anjali.singhai@xxxxxxxxx> wrote:
> >> > >
> >> > > >From: John Fastabend <john.fastabend@xxxxxxxxx>
> >> > > >Sent: Tuesday, May 28, 2024 1:17 PM
> >> > >
> >> > > >Jain, Vipin wrote:
> >> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
> >> > > >>
> >> > > >> My apologies, earlier email used html and was blocked by the list...
> >> > > >> My response at the bottom as "VJ>"
> >> > > >>
> >> > > >> ________________________________________
> >> > >
> >> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
> >> > >
> >> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
> >> > >
> >> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
> >> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
> >> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
> >> > >
> >> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
> >> > >
> >> > > >.John
> >> > >
> >> > >
> >> > > John,
> >> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
> >> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
> >> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
> >> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
> >> > > We feel P4TC approach is the path to add Linux kernel support.
> >> > >
> >> > > The s/w path is needed as well for several reasons.
> >> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
> >>
> >> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
> >> >
> >> > Hi Anjali,
> >> >
> >> > Are there any use cases of P4-TC that don't involve P4 hardware? If
> >> > someone wanted to write one off datapath code for their deployment and
> >> > they didn't have P4 hardware would you suggest that they write they're
> >> > code in P4-TC? The reason I ask is because I'm concerned about the
> >> > performance of P4-TC. Like John said, this is mapping code that is
> >> > intended to run in specialized hardware into a CPU, and it's also
> >> > interpreted execution in TC. The performance numbers in
> >> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
> >> > seem to show that P4-TC has about half the performance of XDP. Even
> >> > with a lot of work, it's going to be difficult to substantially close
> >> > that gap.
> >>
> >> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
> >> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
> >> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
> >> If we held back every innovation because it was slower than something else, progress would suffer.
> >> >
> >
> >
> > Yes, XDP can be emitted based on compiler options (and was a motivation factor in considering use of eBPF). Tom's comment above seems to confuse the fact that XDP tends to be faster than TC with eBPF as the fault of P4TC.
> > In any case this statement falls under: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant
>
> Jamal,
>
> From that: "My response has always consistently been: performance is a
> lower priority to P4 correctness and expressibility." That might be
> true for P4, but not for the kernel. CPU performance is important, and
> your statement below that justifies offloads on the basis that "no
> general purpose CPU will save you" confirms that. Please be more
> upfront about what  the performance is like including performance
> numbers in the cover letter for the next patch set. This is the best
> way to avoid confusion and rampant speculation, and if performance
> isn't stellar being open about it in the community is the best way to
> figure out how to improve it.

I believe you are misreading those graphs or maybe you are mixing it
with the original u32/pedit script approach? The tests are run at TC
and XDP layers. Pay particular attention to the results of the
handcoded/tuned eBPF datapath at TC and at XDP compared to analogous
ones generated by the compiler. You will notice +/-5% or so
differences. That is with the current compiler generated code. We are
looking to improve that - but do note that is generated code, nothing
to do with the kernel. As the P4 program becomes more complex (many
tables, longer keys, more entries, more complex actions) then we
become compute bound, so no difference really.

Now having said that: yes - s/w performance is certainly _not our
highest priority feature_ and that is not saying we dont care but as
the text said If i am getting 2Mpps using handcoding vs 1.84Mpps using
generated code(per those graphs) and i can generate code and execute
it in 5 minutes (Chris who is knowledgeable in P4 was able to do it in
less time), then _i pick the code generation any day of the week_.
Tooling, tooling, tooling.
To re-iterate, the most important requirement is the abstraction, meaning:
I can take the same P4 program I am running in s/w and generate using
a different backend for AMD or Intel offload equivalent and get
several magnitude improvements in performance because it is now
running in h/w. I still get to use the same application controlling
either s/w and/or hardware, etc

TBH, I am indifferent and could add some numbers but it is missing the
emphasis of what we are trying to achieve, the cover letter is already
half a novel - with the short attention span most people have it will
be just muddying the waters.

> >
> > On Tom's theory that the vendors are going to push inferior s/w for the sake of selling h/w: we are not in the 90s anymore and there's no vendor conspiracy theory here: a single port can do 100s of Gbps, and of course if you want to do high speed you need to offload, no general purpose CPU will save you.
>
> Let's not pretend that offloads are a magic bullet that just makes
> everything better, if that were true then we'd all be using TOE by
> now! There are a myriad of factors to consider whether offloading is
> worth it. What is "high speed", is this small packets or big packets,
> are we terminating TCP, are we doing some sort of fast/slow path split
> which might work great in the lab but on the Internet can become a DOS
> vector? What's the application? Are we just trying to offload parts of
> the datapath, TCP, RDMA, memcached, ML reduce operations? Are we
> trying to do line rate encryption, compression, trying to do a billion
> PCB lookups a second? Are we taking into account continuing
> advancements in the CPU that have in the past made offloads obsolete
> (for instance, AES instructions pretty much obsoleted initial attempts
> to obsolete IPsec)? How simple is the programming model, how
> debuggable is it, what's the TCO?
>
> I do believe offload is part of the solution. And the good news is
> that programmable devices facilitate that. IMO, our challenge is to
> create a facility in the kernel to kernel offloads in a much better
> way (I don't believe there's disagreement with these points).
>

This is about a MAT(match-action table) model whose offloads are
covered via TC and is well understood and is very specific.
We are not trying to solve "the world of offloads" which includes
TOEs. P4 aware NICs are in the market and afaik those ASICs are not
solving TOE. I thought you understand the scope but if not start by
reading this: https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md

cheers,
jamal

> Tom
>
>
>
>
>
> >
> > cheers,
> > jamal
> >
> >>
> >> > The risk if we allow this into the kernel is that a vendor might be
> >> > tempted to point to P4-TC performance as a baseline to justify to
> >> > customers that they need to buy specialized hardware to get
> >> > performance, whereas if XDP was used maybe they don't need the
> >> > performance and cost of hardware.
> >>
> >> I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.
> >>
> >> > Note, this scenario already happened
> >> > once before, when the DPDK joined LF they made bogus claims that they
> >> > got a 100x performance over the kernel-- had they put at least the
> >> > slightest effort into tuning the kernel that would have dropped the
> >> > delta by an order of magnitude, and since then we've pretty much
> >> > closed the gap (actually, this is precisely what motivated the
> >> > creation of XDP so I guess that story had a happy ending!) . There are
> >> > circumstances where hardware offload may be warranted, but it needs to
> >> > be honestly justified by comparing it to an optimized software
> >> > solution-- so in the case of P4, it should be compared to well written
> >> > XDP code for instance, not P4-TC.
> >>
> >> I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
> >> Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
> >> Today the gap between C and assembly is small, but way back in my career, C was way slower.
> >> Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
> >> P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.
> >>
> >> Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
> >> There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
> >> I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
> >> is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.
> >> There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
> >> >
> >> > Tom
> >> >
> >> > >
> >> > >
> >> > > Anjali
> >> >