Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > On Thu, Dec 8, 2022 at 4:54 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: >> >> > On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: >> >> >> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> >> >> >> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> >> >> >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: >> >> >> >> >> >> >> >> > From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> >> >> >> >> > >> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe >> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the >> >> >> >> > XDP ctx to do this. >> >> >> >> >> >> >> >> So I finally managed to get enough ducks in row to actually benchmark >> >> >> >> this. With the caveat that I suddenly can't get the timestamp support to >> >> >> >> work (it was working in an earlier version, but now >> >> >> >> timestamp_supported() just returns false). I'm not sure if this is an >> >> >> >> issue with the enablement patch, or if I just haven't gotten the >> >> >> >> hardware configured properly. I'll investigate some more, but figured >> >> >> >> I'd post these results now: >> >> >> >> >> >> >> >> Baseline XDP_DROP: 25,678,262 pps / 38.94 ns/pkt >> >> >> >> XDP_DROP + read metadata: 23,924,109 pps / 41.80 ns/pkt >> >> >> >> Overhead: 1,754,153 pps / 2.86 ns/pkt >> >> >> >> >> >> >> >> As per the above, this is with calling three kfuncs/pkt >> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's >> >> >> >> ~0.95 ns per function call, which is a bit less, but not far off from >> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the >> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's >> >> >> >> definitely in that ballpark. >> >> >> >> >> >> >> >> I'm not doing anything with the data, just reading it into an on-stack >> >> >> >> buffer, so this is the smallest possible delta from just getting the >> >> >> >> data out of the driver. I did confirm that the call instructions are >> >> >> >> still in the BPF program bytecode when it's dumped back out from the >> >> >> >> kernel. >> >> >> >> >> >> >> >> -Toke >> >> >> >> >> >> >> > >> >> >> > Oh, that's great, thanks for running the numbers! Will definitely >> >> >> > reference them in v4! >> >> >> > Presumably, we should be able to at least unroll most of the >> >> >> > _supported callbacks if we want, they should be relatively easy; but >> >> >> > the numbers look fine as is? >> >> >> >> >> >> Well, this is for one (and a half) piece of metadata. If we extrapolate >> >> >> it adds up quickly. Say we add csum and vlan tags, say, and maybe >> >> >> another callback to get the type of hash (l3/l4). Those would probably >> >> >> be relevant for most packets in a fairly common setup. Extrapolating >> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the >> >> >> baseline of 39 ns. >> >> >> >> >> >> So in that sense I still think unrolling makes sense. At least for the >> >> >> _supported() calls, as eating a whole function call just for that is >> >> >> probably a bit much (which I think was also Jakub's point in a sibling >> >> >> thread somewhere). >> >> > >> >> > imo the overhead is tiny enough that we can wait until >> >> > generic 'kfunc inlining' infra is ready. >> >> > >> >> > We're planning to dual-compile some_kernel_file.c >> >> > into native arch and into bpf arch. >> >> > Then the verifier will automatically inline bpf asm >> >> > of corresponding kfunc. >> >> >> >> Is that "planning" or "actively working on"? Just trying to get a sense >> >> of the time frames here, as this sounds neat, but also something that >> >> could potentially require quite a bit of fiddling with the build system >> >> to get to work? :) >> > >> > "planning", but regardless how long it takes I'd rather not >> > add any more tech debt in the form of manual bpf asm generation. >> > We have too much of it already: gen_lookup, convert_ctx_access, etc. >> >> Right, I'm no fan of the manual ASM stuff either. However, if we're >> stuck with the function call overhead for the foreseeable future, maybe >> we should think about other ways of cutting down the number of function >> calls needed? >> >> One thing I can think of is to get rid of the individual _supported() >> kfuncs and instead have a single one that lets you query multiple >> features at once, like: >> >> __u64 features_supported, features_wanted = XDP_META_RX_HASH | XDP_META_TIMESTAMP; >> >> features_supported = bpf_xdp_metadata_query_features(ctx, features_wanted); >> >> if (features_supported & XDP_META_RX_HASH) >> hash = bpf_xdp_metadata_rx_hash(ctx); >> >> ...etc > > I'm not too happy about having the bitmasks tbh :-( > If we want to get rid of the cost of those _supported calls, maybe we > can do some kind of libbpf-like probing? That would require loading a > program + waiting for some packet though :-( If we expect the program to do out of band probing, we could just get rid of the _supported() functions entirely? I mean, to me, the whole point of having the separate _supported() function for each item was to have a lower-overhead way of checking if the metadata item was supported. But if the overhead is not actually lower (because both incur a function call), why have them at all? Then we could just change the implementation from this: bool mlx5e_xdp_rx_hash_supported(const struct xdp_md *ctx) { const struct mlx5_xdp_buff *_ctx = (void *)ctx; return _ctx->xdp.rxq->dev->features & NETIF_F_RXHASH; } u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx) { const struct mlx5_xdp_buff *_ctx = (void *)ctx; return be32_to_cpu(_ctx->cqe->rss_hash_result); } to this: u32 mlx5e_xdp_rx_hash(const struct xdp_md *ctx) { const struct mlx5_xdp_buff *_ctx = (void *)ctx; if (!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)) return 0; return be32_to_cpu(_ctx->cqe->rss_hash_result); } -Toke