RE: [PATCH net-next v12 00/15] Introducing P4TC (series 1)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>From: Paolo Abeni mailto:pabeni@xxxxxxxxxx 
>Sent: Thursday, February 29, 2024 9:14 AM
>To: Jamal Hadi Salim mailto:jhs@xxxxxxxxxxxx; mailto:netdev@xxxxxxxxxxxxxxx
>Cc: mailto:deb.chatterjee@xxxxxxxxx; mailto:anjali.singhai@xxxxxxxxx; mailto:namrata.limaye@xxxxxxxxx; mailto:tom@xxxxxxxxxx; mailto:mleitner@xxxxxxxxxx; mailto:Mahesh.Shirshyad@xxxxxxx; mailto:Vipin.Jain@xxxxxxx; mailto:tomasz.osinski@xxxxxxxxx; mailto:jiri@xxxxxxxxxxx; mailto:xiyou.wangcong@xxxxxxxxx; mailto:davem@xxxxxxxxxxxxx; mailto:edumazet@xxxxxxxxxx; mailto:kuba@xxxxxxxxxx; mailto:vladbu@xxxxxxxxxx; mailto:horms@xxxxxxxxxx; mailto:khalidm@xxxxxxxxxx; mailto:toke@xxxxxxxxxx; mailto:daniel@xxxxxxxxxxxxx; mailto:victor@xxxxxxxxxxxx; mailto:pctammela@xxxxxxxxxxxx; mailto:dan.daly@xxxxxxxxx; mailto:andy.fingerhut@xxxxxxxxx; Chris Sommers mailto:chris.sommers@xxxxxxxxxxxx; mailto:mattyk@xxxxxxxxxx; mailto:bpf@xxxxxxxxxxxxxxx
>Subject: Re: [PATCH net-next v12 00/15] Introducing P4TC (series 1)
>
>On Sun, 2024-02-25 at 11: 54 -0500, Jamal Hadi Salim wrote: > This is the first patchset of two. In this patch we are submitting 15 which > cover the minimal viable P4 PNA architecture. > > __Description of these Patches__ > > 
>ZjQcmQRYFpfptBannerStart
>This Message is From an External Sender: Use caution opening files, clicking links or responding to requests. 
>
>
>
>ZjQcmQRYFpfptBannerEnd
>On Sun, 2024-02-25 at 11:54 -0500, Jamal Hadi Salim wrote:
>> This is the first patchset of two. In this patch we are submitting 15 which
>> cover the minimal viable P4 PNA architecture.
>> 
>> __Description of these Patches__
>> 
>> Patch #1 adds infrastructure for per-netns P4 actions that can be created on
>> as need basis for the P4 program requirement. This patch makes a small incision
>> into act_api. Patches 2-4 are minimalist enablers for P4TC and have no
>> effect the classical tc action (example patch#2 just increases the size of the
>> action names from 16->64B).
>> Patch 5 adds infrastructure support for preallocation of dynamic actions.
>> 
>> The core P4TC code implements several P4 objects.
>> 1) Patch #6 introduces P4 data types which are consumed by the rest of the code
>> 2) Patch #7 introduces the templating API. i.e. CRUD commands for templates
>> 3) Patch #8 introduces the concept of templating Pipelines. i.e CRUD commands
>>    for P4 pipelines.
>> 4) Patch #9 introduces the action templates and associated CRUD commands.
>> 5) Patch #10 introduce the action runtime infrastructure.
>> 6) Patch #11 introduces the concept of P4 table templates and associated
>>    CRUD commands for tables.
>> 7) Patch #12 introduces runtime table entry infra and associated CU commands.
>> 8) Patch #13 introduces runtime table entry infra and associated RD commands.
>> 9) Patch #14 introduces interaction of eBPF to P4TC tables via kfunc.
>> 10) Patch #15 introduces the TC classifier P4 used at runtime.
>> 
>> Daniel, please look again at patch #15.
>> 
>> There are a few more patches (5) not in this patchset that deal with test
>> cases, etc.
>> 
>> What is P4?
>> -----------
>> 
>> The Programming Protocol-independent Packet Processors (P4) is an open source,
>> domain-specific programming language for specifying data plane behavior.
>> 
>> The current P4 landscape includes an extensive range of deployments, products,
>> projects and services, etc[9][12]. Two major NIC vendors, Intel[10] and AMD[11]
>> currently offer P4-native NICs. P4 is currently curated by the Linux
>> Foundation[9].
>> 
>> On why P4 - see small treatise here:[4].
>> 
>> What is P4TC?
>> -------------
>> 
>> P4TC is a net-namespace aware P4 implementation over TC; meaning, a P4 program
>> and its associated objects and state are attachend to a kernel _netns_ structure.
>> IOW, if we had two programs across netns' or within a netns they have no
>> visibility to each others objects (unlike for example TC actions whose kinds are
>> "global" in nature or eBPF maps visavis bpftool).
>> 
>> P4TC builds on top of many years of Linux TC experiences of a netlink control
>> path interface coupled with a software datapath with an equivalent offloadable
>> hardware datapath. In this patch series we are focussing only on the s/w
>> datapath. The s/w and h/w path equivalence that TC provides is relevant
>> for a primary use case of P4 where some (currently) large consumers of NICs
>> provide vendors their datapath specs in P4. In such a case one could generate
>> specified datapaths in s/w and test/validate the requirements before hardware
>> acquisition(example [12]).
>> 
>> Unlike other approaches such as TC Flower which require kernel and user space
>> changes when new datapath objects like packet headers are introduced P4TC, with
>> these patches, provides _kernel and user space code change independence_.
>> Meaning:
>> A P4 program describes headers, parsers, etc alongside the datapath processing;
>> the compiler uses the P4 program as input and generates several artifacts which
>> are then loaded into the kernel to manifest the intended datapath. In addition
>> to the generated datapath, control path constructs are generated. The process is
>> described further below in "P4TC Workflow".
>> 
>> There have been many discussions and meetings within the community since
>> about 2015 in regards to P4 over TC[2] and we are finally proving to the
>> naysayers that we do get stuff done!
>> 
>> A lot more of the P4TC motivation is captured at:
>> https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7ZSCV8wc$
>> 
>> __P4TC Architecture__
>> 
>> The current architecture was described at netdevconf 0x17[14] and if you prefer
>> academic conference papers, a short paper is available here[15].
>> 
>> There are 4 parts:
>> 
>> 1) A Template CRUD provisioning API for manifesting a P4 program and its
>> associated objects in the kernel. The template provisioning API uses netlink.
>> See patch in part 2.
>> 
>> 2) A Runtime CRUD+ API code which is used for controlling the different runtime
>> behavior of the P4 objects. The runtime API uses netlink. See notes further
>> down. See patch description later..
>> 
>> 3) P4 objects and their control interfaces: tables, actions, externs, etc.
>> Any object that requires control plane interaction resides in the TC domain
>> and is subject to the CRUD runtime API.  The intended goal is to make use of the
>> tc semantics of skip_sw/hw to target P4 program objects either in s/w or h/w.
>> 
>> 4) S/W Datapath code hooks. The s/w datapath is eBPF based and is generated
>> by a compiler based on the P4 spec. When accessing any P4 object that requires
>> control plane interfaces, the eBPF code accesses the P4TC side from #3 above
>> using kfuncs.
>> 
>> The generated eBPF code is derived from [13] with enhancements and fixes to meet
>> our requirements.
>> 
>> __P4TC Workflow__
>> 
>> The Development and instantiation workflow for P4TC is as follows:
>> 
>>   A) A developer writes a P4 program, "myprog"
>> 
>>   B) Compiles it using the P4C compiler[8]. The compiler generates 3 outputs:
>> 
>>      a) A shell script which form template definitions for the different P4
>>      objects "myprog" utilizes (tables, externs, actions etc). See #1 above..
>> 
>>      b) the parser and the rest of the datapath are generated as eBPF and need
>>      to be compiled into binaries. At the moment the parser and the main control
>>      block are generated as separate eBPF program but this could change in
>>      the future (without affecting any kernel code). See #4 above.
>> 
>>      c) A json introspection file used for the control plane (by iproute2/tc).
>> 
>>   C) At this point the artifacts from #1,#4 could be handed to an operator
>>      (the operator could be the same person as the developer from #A, #B).
>> 
>>      i) For the eBPF part, either the operator is handed an ebpf binary or
>>      source which they compile at this point into a binary.
>>      The operator executes the shell script(s) to manifest the functional
>>      "myprog" into the kernel.
>> 
>>      ii) The operator instantiates "myprog" pipeline via the tc P4 filter
>>      to ingress/egress (depending on P4 arch) of one or more netdevs/ports
>>      (illustrated below as "block 22").
>> 
>>      Example instantion where the parser is a separate action:
>>        "tc filter add block 22 ingress protocol all prio 10 p4 pname myprog \
>>         action bpf obj $PARSER.o section p4tc/parse \
>>         action bpf obj $PROGNAME.o section p4tc/main"
>> 
>> See individual patches in partc for more examples tc vs xdp etc. Also see
>> section on "challenges" (further below on this cover letter).
>> 
>> Once "myprog" P4 program is instantiated one can start performing operations
>> on table entries and/or actions at runtime as described below.
>> 
>> __P4TC Runtime Control Path__
>> 
>> The control interface builds on past tc experience and tries to get things
>> right from the beginning (example filtering is separated from depending
>> on existing object TLVs and made generic); also the code is written in
>> such a way it is mostly lockless.
>> 
>> The P4TC control interface, using netlink, provides what we call a CRUDPS
>> abstraction which stands for: Create, Read(get), Update, Delete, Subscribe,
>> Publish.  From a high level PoV the following describes a conformant high level
>> API (both on netlink data model and code level):
>> 
>>  Create(</path/to/object, DATA>+)
>>  Read(</path/to/object>, [optional filter])
>>  Update(</path/to/object>, DATA>+)
>>  Delete(</path/to/object>, [optional filter])
>>  Subscribe(</path/to/object>, [optional filter])
>> 
>> Note, we _dont_ treat "dump" or "flush" as speacial. If "path/to/object" points
>> to a table then a "Delete" implies "flush" and a "Read" implies dump but if
>> it points to an entry (by specifying a key) then "Delete" implies deleting
>> and entry and "Read" implies reading that single entry. It should be noted that
>> both "Delete" and "Read" take an optional filter parameter. The filter can
>> define further refinements to what the control plane wants read or deleted.
>> "Subscribe" uses built in netlink event management. It, as well, takes a filter
>> which can further refine what events get generated to the control plane (taken
>> out of this patchset, to be re-added with consideration of [16]).
>> 
>> Lets show some runtime samples:
>> 
>> ..create an entry, if we match ip address 10.0.1.2 send packet out eno1
>>   tc p4ctrl create myprog/table/mytable \
>>    dstAddr 10.0.1.2/32 action send_to_port param port eno1
>> 
>> ..Batch create entries
>>   tc p4ctrl create myprog/table/mytable \
>>   entry dstAddr 10.1.1.2/32  action send_to_port param port eno1 \
>>   entry dstAddr 10.1.10.2/32  action send_to_port param port eno10 \
>>   entry dstAddr 10.0.2.2/32  action send_to_port param port eno2
>> 
>> ..Get an entry (note "read" is interchangeably used as "get" which is a common
>>      semantic in tc):
>>   tc p4ctrl read myprog/table/mytable \
>>    dstAddr 10.0.2.2/32
>> 
>> ..dump mytable
>>   tc p4ctrl read myprog/table/mytable
>> 
>> ..dump mytable for all entries whose key fits within 10.1.0.0/16
>>   tc p4ctrl read myprog/table/mytable \
>>   filter key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> ..dump all mytable entries which have an action send_to_port with param "eno1"
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1"
>> 
>> The filter expression is powerful, f.e you could say:
>> 
>>   tc p4ctrl get myprog/table/mytable \
>>   filter param/act/myprog/send_to_port/port = "eno1" && \
>>          key/myprog/mytable/dstAddr = 10.1.0.0/16
>> 
>> It also works on built in metadata, example in the following case dumping
>> entries from mytable that have seen activity in the last 10 secs:
>>   tc p4ctrl get myprog/table/mytable \
>>   filter msecs_since < 10000
>> 
>> Delete follows the same syntax as get/read, so for sake of brevity we won't
>> show more example than how to flush mytable:
>> 
>>   tc p4ctrl delete myprog/table/mytable
>> 
>> Mystery question: How do we achieve iproute2-kernel independence and
>> how does "tc p4ctrl" as a cli know how to program the kernel given an
>> arbitrary command line as shown above? Answer(s): It queries the
>> compiler generated json file in "P4TC Workflow" #B.c above. The json file has
>> enough details to figure out that we have a program called "myprog" which has a
>> table "mytable" that has a key name "dstAddr" which happens to be type ipv4
>> address prefix. The json file also provides details to show that the table
>> "mytable" supports an action called "send_to_port" which accepts a parameter
>> "port" of type netdev (see the types patch for all supported P4 data types).
>> All P4 components have names, IDs, and types - so this makes it very easy to map
>> into netlink.
>> Once user space tc/p4ctrl validates the human command input, it creates
>> standard binary netlink structures (TLVs etc) which are sent to the kernel.
>> See the runtime table entry patch for more details.
>> 
>> __P4TC Datapath__
>> 
>> The P4TC s/w datapath execution is generated as eBPF. Any objects that require
>> control interfacing reside in the "P4TC domain" and are controlled via netlink
>> as described above. Per packet execution and state and even objects that do not
>> require control interfacing (like the P4 parser) are generated as eBPF.
>> 
>> A packet arriving on s/w ingress of any of the ports on block 22 will first be
>> exercised via the (generated eBPF) parser component to extract the headers (the
>> ip destination address in labelled "dstAddr" above).
>> The datapath then proceeds to use "dstAddr", table ID and pipeline ID
>> as a key to do a lookup in myprog's "mytable" which returns the action params
>> which are then used to execute the action in the eBPF datapath (eventually
>> sending out packets to eno1).
>> On a table miss, mytable's default miss action (not described) is executed.
>> 
>> __Testing__
>> 
>> Speaking of testing - we have 2-300 tdc test cases (which will be in the
>> second patchset).
>> These tests are run on our CICD system on pull requests and after commits are
>> approved. The CICD does a lot of other tests (more since v2, thanks to Simon's
>> input)including:
>> checkpatch, sparse, smatch, coccinelle, 32 bit and 64 bit builds tested on both
>> X86, ARM 64 and emulated BE via qemu s390. We trigger performance testing in the
>> CICD to catch performance regressions (currently only on the control path, but
>> in the future for the datapath).
>> Syzkaller runs 24/7 on dedicated hardware, originally we focussed only on memory
>> sanitizer but recently added support for concurrency sanitizer.
>> Before main releases we ensure each patch will compile on its own to help in
>> git bisect and run the xmas tree tool. We eventually put the code via coverity.
>> 
>> In addition we are working on enabling a tool that will take a P4 program, run
>> it through the compiler, and generate permutations of traffic patterns via
>> symbolic execution that will test both positive and negative datapath code
>> paths. The test generator tool integration is still work in progress.
>> Also: We have other code that test parallelization etc which we are trying to
>> find a fit for in the kernel tree's testing infra.
>> 
>> 
>> __References__
>> 
>> [1]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/p4-conference-2023/2023P4WorkshopP4TC.pdf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7bPf6Tk4$
>> [2]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*historical-perspective-for-p4tc__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7LkM5QJk$
>> [3]https://urldefense.com/v3/__https://2023p4workshop.sched.com/event/1KsAe/p4tc-linux-kernel-p4-implementation-approaches-and-evaluation__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O73gpmAKE$
>> [4]https://urldefense.com/v3/__https://github.com/p4tc-dev/docs/blob/main/why-p4tc.md*so-why-p4-and-how-does-p4-help-here__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7fvy73gU$
>> [5]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@xxxxxxxxxxxx/T/*mf59be7abc5df3473cff3879c8cc3e2369c0640a6__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7rJJDxSc$
>> [6]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@xxxxxxxxxxxx/T/*m783cfd79e9d755cf0e7afc1a7d5404635a5b1919__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O74EMrBVI$
>> [7]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20230517110232.29349-3-jhs@xxxxxxxxxxxx/T/*ma8c84df0f7043d17b98f3d67aab0f4904c600469__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7-6T3BD8$
>> [8]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/tc__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7EsGj_yE$
>> [9]https://urldefense.com/v3/__https://p4.org/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7MA51wp8$
>> [10]https://urldefense.com/v3/__https://www.intel.com/content/www/us/en/products/details/network-io/ipu/e2000-asic.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7HaJpkWg$
>> [11]https://urldefense.com/v3/__https://www.amd.com/en/accelerators/pensando__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7u8agJlY$
>> [12]https://urldefense.com/v3/__https://github.com/sonic-net/DASH/tree/main__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O77NF6LU0$
>> [13]https://urldefense.com/v3/__https://github.com/p4lang/p4c/tree/main/backends/ebpf__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Hn8dxDI$
>> [14]https://urldefense.com/v3/__https://netdevconf.info/0x17/sessions/talk/integrating-ebpf-into-the-p4tc-datapath.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7dDtnoik$
>> [15]https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3630047.3630193__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7zb87EuI$
>> [16]https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231216123001.1293639-1-jiri@xxxxxxxxxxx/__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7mLYrgl8$
>> [17.a]https://urldefense.com/v3/__https://netdevconf.info/0x13/session.html?talk-tc-u-classifier__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7qSaba8A$
>> [17.b]man tc-u32
>> [18]man tc-pedit
>> [19] https://urldefense.com/v3/__https://lore.kernel.org/netdev/20231219181623.3845083-6-victor@xxxxxxxxxxxx/T/*m86e71743d1d83b728bb29d5b877797cb4942e835__;Iw!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7Uc3-7Vg$
>> [20.a] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/talk/your-network-datapath-will-be-p4-scripted.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7YIAkKuc$
>> [20.b] https://urldefense.com/v3/__https://netdevconf.info/0x16/sessions/workshop/p4tc-workshop.html__;!!I5pVk4LIGAfnvw!nUhTLe4nLBPIqlW8u7cJYnSCbFNQSDW318aG2nqxbMZuptPK41snJiHein5mNazAYWIwAfJ3o8O7_8aEvEI$
>> 
>> --------
>> HISTORY
>> --------
>> 
>> Changes in Version 12
>> ----------------------
>> 
>> 0) Introduce back 15 patches (v11 had 5)
>> 
>> 1) From discussions with Daniel:
>>    i) Remove the XDP programs association alltogether. No refcounting. nothing.
>>    ii) Remove prog type tc - everything is now an ebpf tc action.
>> 
>> 2) s/PAD0/__pad0/g. Thanks to Marcelo.
>> 
>> 3) Add extack to specify how many entries (N of M) specified in a batch for
>>    any of requested Create/Update/Delete succeeded. Prior to this it would
>>    only tell us the batch failed to complete without giving us details of
>>    which of M failed. Added as a debug aid.
>> 
>> Changes in Version 11
>> ----------------------
>> 1) Split the series into two. Original patches 1-5 in this patchset. The rest
>>    will go out after this is merged.
>> 
>> 2) Change any references of IFNAMSIZ in the action code when referencing the
>>    action name size to ACTNAMSIZ. Thanks to Marcelo.
>> 
>> Changes in Version 10
>> ----------------------
>> 1) A couple of patches from the earlier version were clean enough to submit,
>>    so we did. This gave us room to split the two largest patches each into
>>    two. Even though the split is not git-bisactable and really some of it didn't
>>    make much sense (eg spliting a create, and update in one patch and delete and
>>    get into another) we made sure each of the split patches compiled
>>    independently. The idea is to reduce the number of lines of code to review
>>    and when we get sufficient reviews we will put the splits together again.
>>    See patch #12 and #13 as well as patches #7 and #8).
>> 
>> 2) Add more context in patch 0. Please READ!
>> 
>> 3) Added dump/delete filters back to the code - we had taken them out in the
>>    earlier patches to reduce the amount of code for review - but in retrospect
>>    we feel they are important enough to push earlier rather than later.
>> 
>> 
>> Changes In version 9
>> ---------------------
>> 
>> 1) Remove the largest patch (externs) to ease review.
>> 
>> 2) Break up action patches into two to ease review bringing down the patches
>>    that need more scrutiny to 8 (the first 7 are almost trivial).
>> 
>> 3) Fixup prefix naming convention to p4tc_xxx for uapi and p4a_xxx for actions
>>    to provide consistency(Jiri).
>> 
>> 4) Silence sparse warning "was not declared. Should it be static?" for kfuncs
>>    by making them static. TBH, not sure if this is the right solution
>>    but it makes sparse happy and hopefully someone will comment.
>> 
>> Changes In Version 8
>> ---------------------
>> 
>> 1) Fix all the patchwork warnings and improve our ci to catch them in the future
>> 
>> 2) Reduce the number of patches to basic max(15)  to ease review.
>> 
>> Changes In Version 7
>> -------------------------
>> 
>> 0) First time removing the RFC tag!
>> 
>> 1) Removed XDP cookie. It turns out as was pointed out by Toke(Thanks!) - that
>> using bpf links was sufficient to protect us from someone replacing or deleting
>> a eBPF program after it has been bound to a netdev.
>> 
>> 2) Add some reviewed-bys from Vlad.
>> 
>> 3) Small bug fixes from v6 based on testing for ebpf.
>> 
>> 4) Added the counter extern as a sample extern. Illustrating this example because
>>    it is slightly complex since it is possible to invoke it directly from
>>    the P4TC domain (in case of direct counters) or from eBPF (indirect counters).
>>    It is not exactly the most efficient implementation (a reasonable counter impl
>>    should be per-cpu).
>> 
>> Changes In RFC Version 6
>> -------------------------
>> 
>> 1) Completed integration from scriptable view to eBPF. Completed integration
>>    of externs integration.
>> 
>> 2) Small bug fixes from v5 based on testing.
>> 
>> Changes In RFC Version 5
>> -------------------------
>> 
>> 1) More integration from scriptable view to eBPF. Small bug fixes from last
>>    integration.
>> 
>> 2) More streamlining support of externs via kfunc (create-on-miss, etc)
>> 
>> 3) eBPF linking for XDP.
>> 
>> There is more eBPF integration/streamlining coming (we are getting close to
>> conversion from scriptable domain).
>> 
>> Changes In RFC Version 4
>> -------------------------
>> 
>> 1) More integration from scriptable to eBPF. Small bug fixes.
>> 
>> 2) More streamlining support of externs via kfunc (one additional kfunc).
>> 
>> 3) Removed per-cpu scratchpad per Toke's suggestion and instead use XDP metadata.
>> 
>> There is more eBPF integration coming. One thing we looked at but is not in this
>> patchset but should be in the next is use of eBPF link in our loading (see
>> "challenge #1" further below).
>> 
>> Changes In RFC Version 3
>> -------------------------
>> 
>> These patches are still in a little bit of flux as we adjust to integrating
>> eBPF. So there are small constructs that are used in V1 and 2 but no longer
>> used in this version. We will make a V4 which will remove those.
>> The changes from V2 are as follows:
>> 
>> 1) Feedback we got in V2 is to try stick to one of the two modes. In this version
>> we are taking one more step and going the path of mode2 vs v2 where we had 2 modes.
>> 
>> 2) The P4 Register extern is no longer standalone. Instead, as part of integrating
>> into eBPF we introduce another kfunc which encapsulates Register as part of the
>> extern interface.
>> 
>> 3) We have improved our CICD to include tools pointed to us by Simon. See
>>    "Testing" further below. Thanks to Simon for that and other issues he caught.
>>    Simon, we discussed on issue [7] but decided to keep that log since we think
>>    it is useful.
>> 
>> 4) A lot of small cleanups. Thanks Marcelo. There are two things we need to
>>    re-discuss though; see: [5], [6].
>> 
>> 5) We removed the need for a range of IDs for dynamic actions. Thanks Jakub.
>> 
>> 6) Clarify ambiguity caused by smatch in an if(A) else if(B) condition. We are
>>    guaranteed that either A or B must exist; however, lets make smatch happy.
>>    Thanks to Simon and Dan Carpenter.
>> 
>> Changes In RFC Version 2
>> -------------------------
>> 
>> Version 2 is the initial integration of the eBPF datapath.
>> We took into consideration suggestions provided to use eBPF and put effort into
>> analyzing eBPF as datapath which involved extensive testing.
>> We implemented 6 approaches with eBPF and ran performance analysis and presented
>> our results at the P4 2023 workshop in Santa Clara[see: 1, 3] on each of the 6
>> vs the scriptable P4TC and concluded that 2 of the approaches are sensible (4 if
>> you account for XDP or TC separately).
>> 
>> Conclusions from the exercise: We lose the simple operational model we had
>> prior to integrating eBPF. We do gain performance in most cases when the
>> datapath is less compute-bound.
>> For more discussion on our requirements vs journeying the eBPF path please
>> scroll down to "Restating Our Requirements" and "Challenges".
>> 
>> This patch set presented two modes.
>> mode1: the parser is entirely based on eBPF - whereas the rest of the
>> SW datapath stays as _scriptable_ as in Version 1.
>> mode2: All of the kernel s/w datapath (including parser) is in eBPF.
>> 
>> The key ingredient for eBPF, that we did not have access to in the past, is
>> kfunc (it made a big difference for us to reconsider eBPF).
>> 
>> In V2 the two modes are mutually exclusive (IOW, you get to choose one
>> or the other via Kconfig).
>
>I think/fear that this series has a "quorum" problem: different voices
>raises opposition, and nobody (?) outside the authors supported the
>code and the feature. 
>
>Could be the missing of H/W offload support in the current form the
>root cause for such lack support? Or there are parties interested that
>have been quite so far?
>
>Thanks,
>
>Paolo
>

Hi Paolo, thanks. I am one of those "parties interested that have been quite so far."

I wanted to voice my staunch support for accepting P4TC into the kernel. None of the present objections in the various threads reduce my enthusiasm. I find the following aspects most compelling:

- Performant, highly functional, pure-SW P4 dataplane

- Near-ubiquitous availability on all platforms, once it's upstreamed. Saves having to install a bunch of other p4 ecosystem tools, lowers the barrier to entry, and increases the likelihood an application can run on any platform.

- larger dev community. Anything added to the Linux kernel benefits from a large, thriving community, vast and rigorous regression testing, long-term support, etc.

- well-conceived CRUDX northbound API and clever use of existing well-understood netlink, easy to overlay other northbound APIs such as TDI (Table driven interface) used in IPDK; P4Runtime gRPC API; etc.

- integration with popular and well-understood tc provides a good impedance match for users.

- extensibility, ability to add externs, and interface to eBPF. The ability to add externs is especially compelling. It is not easy to do so in current backends such as bmv2, P4DPDK or p4-ebpf. 

- roadmap to hardware offload for even greater performance. Even _without_ offload, the above benefits justify it in my mind. There are many applications for a pure-SW P4 dataplane, both in userland like P4DPDK, and the proposed P4TC - running as part of the kernel is _exciting_. Vendors have already voiced their support for offload and this initial set of patches paves the way and lets the community benefit from it and start to make it better, now.

It is possible the detractors of P4TC are not active P4 users, so I hope to provide a bit of perspective. Besides the pioneering switch ASIC (Tofino) use-cases which provided the initial impetus, P4 is used extensively in at least two commercial IPUs/DPUs. In addition, there are multiple toolchains to run P4 code on FPGAs. The dream is to write P4 code which can be run in a scalable fashion on a range of targets. It shouldn’t be necessary to “prove” P4 is worthy, those who’ve already embraced it know this.

There are several use-cases for a SW implementation of a P4 dataplane, including behavioral modeling and production uses. P4 allows one to write core functionality which can run on multiple platforms: pure SW, FPGAs, offload NICs/DPUs/IPUs, switch ASICs.

Behavioral modeling of a pipeline using P4:

- The SONiC-DASH project (https://github.com/sonic-net/DASH) is a thriving, multi-vendor collaboration which specifies advanced, high-performance features to accelerate datacenter services. These overlay services are specified using a P4 program which allows all concerned to agree on the packet pipeline and even the control-plane APIs (using SAI, the Switch Abstraction Interface). The actual implementation on a vendor's offload device (DPU/IPU) may or may not use any of the reference P4 code, but that is not important. What is important is that we specify the dataplane in P4, and execute it on the bmv2 backend in a container. We run conformance and regression suites with standard test vectors, which can also be run against actual production implementations to verify compliance. The bmv2 backend has many limitations, including performance and difficulty to extend its functionality. As a major contributor to this project, I am helping to explore alternatives.

- Large-scale cloud-service providers use P4 extensively as a dataplane (fabric switch) modeling language. One of the driving use-cases in the P4-API working group (I’m a co-chair) is to control SDN switches using P4-Runtime. The switches’ pipelines are modeled in P4 by some users, similar to the DASH use-case. Having a performant, pure-SW implementation is invaluable for modeling and simulation.

Running P4 code in pure SW for production use-cases (not just modeling):

There are many use-cases for running a custom dataplane written in P4. The productivity of P4 code cannot be overstated. With the right framework, P4 apps can be developed (and controlled/managed) in literally hours. It is much more productive than writing, say c or eBPF. I can do all three, and P4 is way more productive for certain applications.

In conclusion, I hope we can upstream P4-TC soon. Please move this forward with all due speed. Thanks!

Chris Sommers
Keysight Technologies




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux