Re: On the NACKs on P4TC patches

Tom Herbert <tom@xxxxxxxxxx> · Wed, 29 May 2024 07:45:55 -0700

On Wed, May 29, 2024 at 4:01 AM Jamal Hadi Salim <jhs@xxxxxxxxxxxx> wrote:
>
>
>
> On Tue, May 28, 2024 at 7:43 PM Chris Sommers <chris.sommers@xxxxxxxxxxxx> wrote:
>>
>> > On Tue, May 28, 2024 at 3:17 PM Singhai, Anjali
>> > <anjali.singhai@xxxxxxxxx> wrote:
>> > >
>> > > >From: John Fastabend <john.fastabend@xxxxxxxxx>
>> > > >Sent: Tuesday, May 28, 2024 1:17 PM
>> > >
>> > > >Jain, Vipin wrote:
>> > > >> [AMD Official Use Only - AMD Internal Distribution Only]
>> > > >>
>> > > >> My apologies, earlier email used html and was blocked by the list...
>> > > >> My response at the bottom as "VJ>"
>> > > >>
>> > > >> ________________________________________
>> > >
>> > > >Anjali and Vipin is your support for HW support of P4 or a Linux SW implementation of P4. If its for HW support what drivers would we want to support? Can you describe how to program >these devices?
>> > >
>> > > >At the moment there hasn't been any movement on Linux hardware P4 support side as far as I can tell. Yes there are some SDKs and build kits floating around for FPGAs. For example >maybe start with what drivers in kernel tree run the DPUs that have this support? I think this would be a productive direction to go if we in fact have hardware support in the works.
>> > >
>> > > >If you want a SW implementation in Linux my opinion is still pushing a DSL into the kernel datapath via qdisc/tc is the wrong direction. Mapping P4 onto hardware blocks is fundamentally >different architecture from mapping
>> > > >P4 onto general purpose CPU and registers. My opinion -- to handle this you need a per architecture backend/JIT to compile the P4 to native instructions.
>> > > >This will give you the most flexibility to define new constructs, best performance, and lowest overhead runtime. We have a P4 BPF backend already and JITs for most architectures I don't >see the need for P4TC in this context.
>> > >
>> > > >If the end goal is a hardware offload control plane I'm skeptical we even need something specific just for SW datapath. I would propose a devlink or new infra to program the device directly >vs overhead and complexity of abstracting through 'tc'. If you want to emulate your device use BPF or user space datapath.
>> > >
>> > > >.John
>> > >
>> > >
>> > > John,
>> > > Let me start by saying production hardware exists i think Jamal posted some links but i can point you to our hardware.
>> > > The hardware devices under discussion are capable of being abstracted using the P4 match-action paradigm so that's why we chose TC.
>> > > These devices are programmed using the TC/netlink interface i.e the standard TC control-driver ops apply. While it is clear to us that the P4TC abstraction suffices, we are currently discussing details that will cater for all vendors in our biweekly meetings.
>> > > One big requirement is we want to avoid the flower trap - we dont want to be changing kernel/user/driver code every time we add new datapaths.
>> > > We feel P4TC approach is the path to add Linux kernel support.
>> > >
>> > > The s/w path is needed as well for several reasons.
>> > > We need the same P4 program to run either in software or hardware or in both using skip_sw/skip_hw. It could be either in split mode or as an exception path as it is done today in flower or u32. Also it is common now in the P4 community that people define their datapath using their program and will write a control application that works for both hardware and software datapaths. They could be using the software datapath for testing as you said but also for the split/exception path. Chris can probably add more comments on the software datapath.
>>
>> Anjali, thanks for asking. Agreed, I like the flexibility of accommodating a variety of platforms depending upon performance requirements and intended target system. For me, flexibility is important. Some solutions need an inline filter and P4-TC makes it so easy. The fact I will be able to get HW offload means I'm not performance bound. Some other solutions might need DPDK implementation, so P4-DPDK is a choice there as well, and there are acceleration options. Keeping much of the dataplane design in one language (P4) makes it easier for more developers to create products without having to be platform-level experts. As someone who's worked with P4 Tofino, P4-TC, bmv2, etc. I can authoritatively state that all have their proper place.
>> >
>> > Hi Anjali,
>> >
>> > Are there any use cases of P4-TC that don't involve P4 hardware? If
>> > someone wanted to write one off datapath code for their deployment and
>> > they didn't have P4 hardware would you suggest that they write they're
>> > code in P4-TC? The reason I ask is because I'm concerned about the
>> > performance of P4-TC. Like John said, this is mapping code that is
>> > intended to run in specialized hardware into a CPU, and it's also
>> > interpreted execution in TC. The performance numbers in
>> > https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!mHilz4xBMimnfapDG8BEgqOuPw_Mn-KiMHb-aNbl8nB8TwfOfSleeIANiNRFQtTc5zfR0aK1TE2J8lT2Fg$
>> > seem to show that P4-TC has about half the performance of XDP. Even
>> > with a lot of work, it's going to be difficult to substantially close
>> > that gap.
>>
>> AFAIK P4-TC can emit XDP or eBPF code depending upon the situation, someone more knowledgeable should chime in.
>> However, I don't agree that comparing the speeds of XDP vs. P4-TC should even be a deciding factor.
>> If P4-TC is good enough for a lot of applications, that is fine by me and over time it'll only get better.
>> If we held back every innovation because it was slower than something else, progress would suffer.
>> >
>
>
> Yes, XDP can be emitted based on compiler options (and was a motivation factor in considering use of eBPF). Tom's comment above seems to confuse the fact that XDP tends to be faster than TC with eBPF as the fault of P4TC.
> In any case this statement falls under: https://github.com/p4tc-dev/pushback-patches?tab=readme-ov-file#2b-comment-but--it-is-not-performant

Jamal,

>From that: "My response has always consistently been: performance is a
lower priority to P4 correctness and expressibility." That might be
true for P4, but not for the kernel. CPU performance is important, and
your statement below that justifies offloads on the basis that "no
general purpose CPU will save you" confirms that. Please be more
upfront about what  the performance is like including performance
numbers in the cover letter for the next patch set. This is the best
way to avoid confusion and rampant speculation, and if performance
isn't stellar being open about it in the community is the best way to
figure out how to improve it.
>
> On Tom's theory that the vendors are going to push inferior s/w for the sake of selling h/w: we are not in the 90s anymore and there's no vendor conspiracy theory here: a single port can do 100s of Gbps, and of course if you want to do high speed you need to offload, no general purpose CPU will save you.

Let's not pretend that offloads are a magic bullet that just makes
everything better, if that were true then we'd all be using TOE by
now! There are a myriad of factors to consider whether offloading is
worth it. What is "high speed", is this small packets or big packets,
are we terminating TCP, are we doing some sort of fast/slow path split
which might work great in the lab but on the Internet can become a DOS
vector? What's the application? Are we just trying to offload parts of
the datapath, TCP, RDMA, memcached, ML reduce operations? Are we
trying to do line rate encryption, compression, trying to do a billion
PCB lookups a second? Are we taking into account continuing
advancements in the CPU that have in the past made offloads obsolete
(for instance, AES instructions pretty much obsoleted initial attempts
to obsolete IPsec)? How simple is the programming model, how
debuggable is it, what's the TCO?

I do believe offload is part of the solution. And the good news is
that programmable devices facilitate that. IMO, our challenge is to
create a facility in the kernel to kernel offloads in a much better
way (I don't believe there's disagreement with these points).

Tom

>
> cheers,
> jamal
>
>>
>> > The risk if we allow this into the kernel is that a vendor might be
>> > tempted to point to P4-TC performance as a baseline to justify to
>> > customers that they need to buy specialized hardware to get
>> > performance, whereas if XDP was used maybe they don't need the
>> > performance and cost of hardware.
>>
>> I really don't buy this argument, it's FUD. Let's judge P4-TC on its merits, not prejudge it as a ploy to sell vendor hardware.
>>
>> > Note, this scenario already happened
>> > once before, when the DPDK joined LF they made bogus claims that they
>> > got a 100x performance over the kernel-- had they put at least the
>> > slightest effort into tuning the kernel that would have dropped the
>> > delta by an order of magnitude, and since then we've pretty much
>> > closed the gap (actually, this is precisely what motivated the
>> > creation of XDP so I guess that story had a happy ending!) . There are
>> > circumstances where hardware offload may be warranted, but it needs to
>> > be honestly justified by comparing it to an optimized software
>> > solution-- so in the case of P4, it should be compared to well written
>> > XDP code for instance, not P4-TC.
>>
>> I strongly disagree that it "it needs to be honestly justified by comparing it to an optimized software solution."
>> Says who? This is no more factual than saying "C or golang need to be judged by comparing it to assembly language."
>> Today the gap between C and assembly is small, but way back in my career, C was way slower.
>> Over time optimizing compilers have closed the gap. Who's to say P4 technologies won't do the same?
>> P4-TC can be judged on its own merits for its utility and productivity. I can't stress enough that P4 is very productive when applied to certain problems.
>>
>> Note, P4-BMv2 has been used by thousands of developers, researchers and students and it is relatively slow. Yet that doesn't deter users.
>> There is a Google Summer of Code project to add PNA support, rather ambitious. However, P4-TC already partially supports PNA and the gap is closing.
>> I feel like P4-TC could replace the use of BMv2 in a lot of applications and if it were upstreamed, it'd eventually be available on all Linux machines. The ability to write custom externs
>> is very compelling. Eventual HW offload using the same code will be game-changing. Bmv2 is a big c++ program and somewhat intimidating to dig into to make enhancements, especially at the architectural level.
>> There is no HW offload path, and it's not really fast, so it remains mainly a researchy-thing and will stay that way. P4-TC could span the needs from research to production in SW, and performant production with HW offload.
>> >
>> > Tom
>> >
>> > >
>> > >
>> > > Anjali
>> >