On Fri, Dec 9, 2022 at 5:29 AM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > On 09/12/2022 06.24, Saeed Mahameed wrote: > > On 08 Dec 18:57, Stanislav Fomichev wrote: > >> On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen > >> <toke@xxxxxxxxxx> wrote: > >>> > >>> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > >>> > >>> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >>> >> > >>> >> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > >>> >> > >>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >>> >> >> > >>> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >>> >> >> > >>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >>> >> >> >> > >>> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >>> >> >> >> > >>> >> >> >> > From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> > >>> >> >> >> > > >>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe > >>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the > >>> >> >> >> > XDP ctx to do this. > >>> >> >> >> > >>> >> >> >> So I finally managed to get enough ducks in row to actually benchmark > >>> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to > >>> >> >> >> work (it was working in an earlier version, but now > >>> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an > >>> >> >> >> issue with the enablement patch, or if I just haven't gotten the > >>> >> >> >> hardware configured properly. I'll investigate some more, but figured > >>> >> >> >> I'd post these results now: > >>> >> >> >> > >>> >> >> >> Baseline XDP_DROP: 25,678,262 pps / 38.94 ns/pkt > >>> >> >> >> XDP_DROP + read metadata: 23,924,109 pps / 41.80 ns/pkt > >>> >> >> >> Overhead: 1,754,153 pps / 2.86 ns/pkt > >>> >> >> >> > >>> >> >> >> As per the above, this is with calling three kfuncs/pkt > >>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's > >>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from > >>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the > >>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's > >>> >> >> >> definitely in that ballpark. > >>> >> >> >> > >>> >> >> >> I'm not doing anything with the data, just reading it into an on-stack > >>> >> >> >> buffer, so this is the smallest possible delta from just getting the > >>> >> >> >> data out of the driver. I did confirm that the call instructions are > >>> >> >> >> still in the BPF program bytecode when it's dumped back out from the > >>> >> >> >> kernel. > >>> >> >> >> > >>> >> >> >> -Toke > >>> >> >> >> > >>> >> >> > > >>> >> >> > Oh, that's great, thanks for running the numbers! Will definitely > >>> >> >> > reference them in v4! > >>> >> >> > Presumably, we should be able to at least unroll most of the > >>> >> >> > _supported callbacks if we want, they should be relatively easy; but > >>> >> >> > the numbers look fine as is? > >>> >> >> > >>> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate > >>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe > >>> >> >> another callback to get the type of hash (l3/l4). Those would probably > >>> >> >> be relevant for most packets in a fairly common setup. Extrapolating > >>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the > >>> >> >> baseline of 39 ns. > >>> >> >> > >>> >> >> So in that sense I still think unrolling makes sense. At least for the > >>> >> >> _supported() calls, as eating a whole function call just for that is > >>> >> >> probably a bit much (which I think was also Jakub's point in a sibling > >>> >> >> thread somewhere). > >>> >> > > >>> >> > imo the overhead is tiny enough that we can wait until > >>> >> > generic 'kfunc inlining' infra is ready. > >>> >> > > >>> >> > We're planning to dual-compile some_kernel_file.c > >>> >> > into native arch and into bpf arch. > >>> >> > Then the verifier will automatically inline bpf asm > >>> >> > of corresponding kfunc. > >>> >> > >>> >> Is that "planning" or "actively working on"? Just trying to get a sense > >>> >> of the time frames here, as this sounds neat, but also something that > >>> >> could potentially require quite a bit of fiddling with the build system > >>> >> to get to work? :) > >>> > > >>> > "planning", but regardless how long it takes I'd rather not > >>> > add any more tech debt in the form of manual bpf asm generation. > >>> > We have too much of it already: gen_lookup, convert_ctx_access, etc. > >>> > >>> Right, I'm no fan of the manual ASM stuff either. However, if we're > >>> stuck with the function call overhead for the foreseeable future, maybe > >>> we should think about other ways of cutting down the number of function > >>> calls needed? > >>> > >>> One thing I can think of is to get rid of the individual _supported() > >>> kfuncs and instead have a single one that lets you query multiple > >>> features at once, like: > >>> > >>> __u64 features_supported, features_wanted = XDP_META_RX_HASH | > >>> XDP_META_TIMESTAMP; > >>> > >>> features_supported = bpf_xdp_metadata_query_features(ctx, > >>> features_wanted); > >>> > >>> if (features_supported & XDP_META_RX_HASH) > >>> hash = bpf_xdp_metadata_rx_hash(ctx); > >>> > >>> ...etc > >> > >> I'm not too happy about having the bitmasks tbh :-( > >> If we want to get rid of the cost of those _supported calls, maybe we > >> can do some kind of libbpf-like probing? That would require loading a > >> program + waiting for some packet though :-( > >> > >> Or maybe they can just be cached for now? > >> > >> if (unlikely(!got_first_packet)) { > >> have_hash = bpf_xdp_metadata_rx_hash_supported(); > >> have_timestamp = bpf_xdp_metadata_rx_timestamp_supported(); > >> got_first_packet = true; > >> } > > > > hash/timestap/csum is per packet .. vlan as well depending how you look at > > it.. > > True, we cannot cache this as it is *per packet* info. > > > Sorry I haven't been following the progress of xdp meta data, but why did > > we drop the idea of btf and driver copying metdata in front of the xdp > > frame ? > > > > It took me some time to understand this new approach, and why it makes > sense. This is my understanding of the design direction change: > > This approach gives more control to the XDP BPF-prog to pick and choose > which XDP hints are relevant for the specific use-case. BPF-prog can > also skip storing hints anywhere and just read+react on value (that e.g. > comes from RX-desc). > > For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC, > SKB-creation, we *do* need to store hints somewhere, as RX-desc will be > out-of-scope. I this patchset hand-waves and says BPF-prog can just > manually store this in a prog custom layout in metadata area. I'm not > super happy with ignoring/hand-waving all these use-case, but I > hope/think we later can extend this some more structure to support these > use-cases better (with this patchset as a foundation). > > I actually like this kfunc design, because the BPF-prog's get an > intuitive API, and on driver side we can hide the details of howto > extract the HW hints. > > > > hopefully future HW generations will do that for free .. > > True. I think it is worth repeating, that the approach of storing HW > hints in metadata area (in-front of packet data) was to allow future HW > generations to write this. Thus, eliminating the 6 ns (that I showed it > cost), and then it would be up-to XDP BPF-prog to pick and choose which > to read, like this patchset already offers. As a hope for future generators of hw, being able to choose a cpu to interrupt from a LPM table would be great. I keep hoping to find a card that can do this already... Also I would like to thank everyone working on this project so far for what you've accomplished. We're now pushing 20Gbit (through a vlan even) through libreqos.io for thousands of ISP subscribers using all this great stuff, on 16 cores at only 24% of cpu through CAKE and also successfully monitoring TCP RTTs at this scale via ebpf pping. ( https://www.yahoo.com/now/libreqoe-releases-version-1-3-214700756.html ) "Our hat is off to the creators of CAKE and the new Linux XDP and eBPF subsystems!" In our case, timestamp, and *3* hashes, are needed for cake, and interrupting the right cpu would be great... > > This patchset isn't incompatible with future HW generations doing this, > as the kfunc would hide the details and point to this area instead of > the RX-desc. While we get the "store for free" from hardware, I do > worry that reading this memory area (which will part of DMA area) is > going to be slower than reading from RX-desc. > > > if btf is the problem then each vendor can provide a bpf func(s) that would > > parse the metdata inside of the xdp/bpf prog domain to help programs > > extract the vendor specific data.. > > > > In some sense, if unroll will becomes a thing, then this patchset is > partly doing this. > > I did imagine that after/followup on XDP-hints with BTF patchset, we > would allow drivers to load an BPF-prog that changed/selected which HW > hints were relevant, to reduce those 6 ns overhead we introduced. > > --Jesper > -- This song goes out to all the folk that thought Stadia would work: https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz Dave Täht CEO, TekLibre, LLC