On Thu, Dec 8, 2022 at 4:29 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > > Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > > > On Thu, Dec 8, 2022 at 4:02 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> > >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >> > >> > On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: > >> >> > >> >> Stanislav Fomichev <sdf@xxxxxxxxxx> writes: > >> >> > >> >> > From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> > >> >> > > >> >> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe > >> >> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the > >> >> > XDP ctx to do this. > >> >> > >> >> So I finally managed to get enough ducks in row to actually benchmark > >> >> this. With the caveat that I suddenly can't get the timestamp support to > >> >> work (it was working in an earlier version, but now > >> >> timestamp_supported() just returns false). I'm not sure if this is an > >> >> issue with the enablement patch, or if I just haven't gotten the > >> >> hardware configured properly. I'll investigate some more, but figured > >> >> I'd post these results now: > >> >> > >> >> Baseline XDP_DROP: 25,678,262 pps / 38.94 ns/pkt > >> >> XDP_DROP + read metadata: 23,924,109 pps / 41.80 ns/pkt > >> >> Overhead: 1,754,153 pps / 2.86 ns/pkt > >> >> > >> >> As per the above, this is with calling three kfuncs/pkt > >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's > >> >> ~0.95 ns per function call, which is a bit less, but not far off from > >> >> the ~1.2 ns that I'm used to. The tests where I accidentally called the > >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's > >> >> definitely in that ballpark. > >> >> > >> >> I'm not doing anything with the data, just reading it into an on-stack > >> >> buffer, so this is the smallest possible delta from just getting the > >> >> data out of the driver. I did confirm that the call instructions are > >> >> still in the BPF program bytecode when it's dumped back out from the > >> >> kernel. > >> >> > >> >> -Toke > >> >> > >> > > >> > Oh, that's great, thanks for running the numbers! Will definitely > >> > reference them in v4! > >> > Presumably, we should be able to at least unroll most of the > >> > _supported callbacks if we want, they should be relatively easy; but > >> > the numbers look fine as is? > >> > >> Well, this is for one (and a half) piece of metadata. If we extrapolate > >> it adds up quickly. Say we add csum and vlan tags, say, and maybe > >> another callback to get the type of hash (l3/l4). Those would probably > >> be relevant for most packets in a fairly common setup. Extrapolating > >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the > >> baseline of 39 ns. > >> > >> So in that sense I still think unrolling makes sense. At least for the > >> _supported() calls, as eating a whole function call just for that is > >> probably a bit much (which I think was also Jakub's point in a sibling > >> thread somewhere). > > > > imo the overhead is tiny enough that we can wait until > > generic 'kfunc inlining' infra is ready. > > > > We're planning to dual-compile some_kernel_file.c > > into native arch and into bpf arch. > > Then the verifier will automatically inline bpf asm > > of corresponding kfunc. > > Is that "planning" or "actively working on"? Just trying to get a sense > of the time frames here, as this sounds neat, but also something that > could potentially require quite a bit of fiddling with the build system > to get to work? :) "planning", but regardless how long it takes I'd rather not add any more tech debt in the form of manual bpf asm generation. We have too much of it already: gen_lookup, convert_ctx_access, etc.