Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata

Dave Taht <dave.taht@xxxxxxxxx> · Fri, 9 Dec 2022 07:19:57 -0800

On Fri, Dec 9, 2022 at 5:29 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
> On 09/12/2022 06.24, Saeed Mahameed wrote:
> > On 08 Dec 18:57, Stanislav Fomichev wrote:
> >> On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen
> >> <toke@xxxxxxxxxx> wrote:
> >>>
> >>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes:
> >>>
> >>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>> >>
> >>> >> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes:
> >>> >>
> >>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>> >> >>
> >>> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes:
> >>> >> >>
> >>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>> >> >> >>
> >>> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes:
> >>> >> >> >>
> >>> >> >> >> > From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
> >>> >> >> >> >
> >>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> >>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> >>> >> >> >> > XDP ctx to do this.
> >>> >> >> >>
> >>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark
> >>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to
> >>> >> >> >> work (it was working in an earlier version, but now
> >>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an
> >>> >> >> >> issue with the enablement patch, or if I just haven't gotten the
> >>> >> >> >> hardware configured properly. I'll investigate some more, but figured
> >>> >> >> >> I'd post these results now:
> >>> >> >> >>
> >>> >> >> >> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
> >>> >> >> >> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
> >>> >> >> >> Overhead:                   1,754,153 pps /  2.86 ns/pkt
> >>> >> >> >>
> >>> >> >> >> As per the above, this is with calling three kfuncs/pkt
> >>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
> >>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from
> >>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the
> >>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
> >>> >> >> >> definitely in that ballpark.
> >>> >> >> >>
> >>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack
> >>> >> >> >> buffer, so this is the smallest possible delta from just getting the
> >>> >> >> >> data out of the driver. I did confirm that the call instructions are
> >>> >> >> >> still in the BPF program bytecode when it's dumped back out from the
> >>> >> >> >> kernel.
> >>> >> >> >>
> >>> >> >> >> -Toke
> >>> >> >> >>
> >>> >> >> >
> >>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely
> >>> >> >> > reference them in v4!
> >>> >> >> > Presumably, we should be able to at least unroll most of the
> >>> >> >> > _supported callbacks if we want, they should be relatively easy; but
> >>> >> >> > the numbers look fine as is?
> >>> >> >>
> >>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate
> >>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe
> >>> >> >> another callback to get the type of hash (l3/l4). Those would probably
> >>> >> >> be relevant for most packets in a fairly common setup. Extrapolating
> >>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
> >>> >> >> baseline of 39 ns.
> >>> >> >>
> >>> >> >> So in that sense I still think unrolling makes sense. At least for the
> >>> >> >> _supported() calls, as eating a whole function call just for that is
> >>> >> >> probably a bit much (which I think was also Jakub's point in a sibling
> >>> >> >> thread somewhere).
> >>> >> >
> >>> >> > imo the overhead is tiny enough that we can wait until
> >>> >> > generic 'kfunc inlining' infra is ready.
> >>> >> >
> >>> >> > We're planning to dual-compile some_kernel_file.c
> >>> >> > into native arch and into bpf arch.
> >>> >> > Then the verifier will automatically inline bpf asm
> >>> >> > of corresponding kfunc.
> >>> >>
> >>> >> Is that "planning" or "actively working on"? Just trying to get a sense
> >>> >> of the time frames here, as this sounds neat, but also something that
> >>> >> could potentially require quite a bit of fiddling with the build system
> >>> >> to get to work? :)
> >>> >
> >>> > "planning", but regardless how long it takes I'd rather not
> >>> > add any more tech debt in the form of manual bpf asm generation.
> >>> > We have too much of it already: gen_lookup, convert_ctx_access, etc.
> >>>
> >>> Right, I'm no fan of the manual ASM stuff either. However, if we're
> >>> stuck with the function call overhead for the foreseeable future, maybe
> >>> we should think about other ways of cutting down the number of function
> >>> calls needed?
> >>>
> >>> One thing I can think of is to get rid of the individual _supported()
> >>> kfuncs and instead have a single one that lets you query multiple
> >>> features at once, like:
> >>>
> >>> __u64 features_supported, features_wanted = XDP_META_RX_HASH |
> >>> XDP_META_TIMESTAMP;
> >>>
> >>> features_supported = bpf_xdp_metadata_query_features(ctx,
> >>> features_wanted);
> >>>
> >>> if (features_supported & XDP_META_RX_HASH)
> >>>   hash = bpf_xdp_metadata_rx_hash(ctx);
> >>>
> >>> ...etc
> >>
> >> I'm not too happy about having the bitmasks tbh :-(
> >> If we want to get rid of the cost of those _supported calls, maybe we
> >> can do some kind of libbpf-like probing? That would require loading a
> >> program + waiting for some packet though :-(
> >>
> >> Or maybe they can just be cached for now?
> >>
> >> if (unlikely(!got_first_packet)) {
> >>  have_hash = bpf_xdp_metadata_rx_hash_supported();
> >>  have_timestamp = bpf_xdp_metadata_rx_timestamp_supported();
> >>  got_first_packet = true;
> >> }
> >
> > hash/timestap/csum is per packet .. vlan as well depending how you look at
> > it..
>
> True, we cannot cache this as it is *per packet* info.
>
> > Sorry I haven't been following the progress of xdp meta data, but why did
> > we drop the idea of btf and driver copying metdata in front of the xdp
> > frame ?
> >
>
> It took me some time to understand this new approach, and why it makes
> sense.  This is my understanding of the design direction change:
>
> This approach gives more control to the XDP BPF-prog to pick and choose
> which XDP hints are relevant for the specific use-case.  BPF-prog can
> also skip storing hints anywhere and just read+react on value (that e.g.
> comes from RX-desc).
>
> For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC,
> SKB-creation, we *do* need to store hints somewhere, as RX-desc will be
> out-of-scope.  I this patchset hand-waves and says BPF-prog can just
> manually store this in a prog custom layout in metadata area.  I'm not
> super happy with ignoring/hand-waving all these use-case, but I
> hope/think we later can extend this some more structure to support these
> use-cases better (with this patchset as a foundation).
>
> I actually like this kfunc design, because the BPF-prog's get an
> intuitive API, and on driver side we can hide the details of howto
> extract the HW hints.
>
>
> > hopefully future HW generations will do that for free ..
>
> True.  I think it is worth repeating, that the approach of storing HW
> hints in metadata area (in-front of packet data) was to allow future HW
> generations to write this.  Thus, eliminating the 6 ns (that I showed it
> cost), and then it would be up-to XDP BPF-prog to pick and choose which
> to read, like this patchset already offers.

As a hope for future generators of hw, being able to choose a cpu to interrupt
from a LPM table would be great. I keep hoping to find a card that can
do this already...

Also I would like to thank everyone working on this project so far for
what you've
accomplished. We're now pushing 20Gbit (through a vlan even) through
libreqos.io for thousands of ISP subscribers using all this great stuff, on
16 cores at only 24% of cpu through CAKE and also successfully monitoring
TCP RTTs at this scale via ebpf pping.

( https://www.yahoo.com/now/libreqoe-releases-version-1-3-214700756.html )
"Our hat is off to the creators of CAKE and the new Linux XDP and eBPF
subsystems!"

In our case, timestamp, and *3* hashes, are needed for cake, and interrupting
the right cpu would be great...

>
> This patchset isn't incompatible with future HW generations doing this,
> as the kfunc would hide the details and point to this area instead of
> the RX-desc.  While we get the "store for free" from hardware, I do
> worry that reading this memory area (which will part of DMA area) is
> going to be slower than reading from RX-desc.
>
> > if btf is the problem then each vendor can provide a bpf func(s) that would
> > parse the metdata inside of the xdp/bpf prog domain to help programs
> > extract the vendor specific data..
> >
>
> In some sense, if unroll will becomes a thing, then this patchset is
> partly doing this.
>
> I did imagine that after/followup on XDP-hints with BTF patchset, we
> would allow drivers to load an BPF-prog that changed/selected which HW
> hints were relevant, to reduce those 6 ns overhead we introduced.
>
> --Jesper
>

-- 
This song goes out to all the folk that thought Stadia would work:
https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
Dave Täht CEO, TekLibre, LLC