Re: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Fri, 09 Dec 2022 01:02:41 +0100

Stanislav Fomichev <sdf@xxxxxxxxxx> writes:

> On Thu, Dec 8, 2022 at 2:59 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> Stanislav Fomichev <sdf@xxxxxxxxxx> writes:
>>
>> > From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
>> >
>> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>> > XDP ctx to do this.
>>
>> So I finally managed to get enough ducks in row to actually benchmark
>> this. With the caveat that I suddenly can't get the timestamp support to
>> work (it was working in an earlier version, but now
>> timestamp_supported() just returns false). I'm not sure if this is an
>> issue with the enablement patch, or if I just haven't gotten the
>> hardware configured properly. I'll investigate some more, but figured
>> I'd post these results now:
>>
>> Baseline XDP_DROP:         25,678,262 pps / 38.94 ns/pkt
>> XDP_DROP + read metadata:  23,924,109 pps / 41.80 ns/pkt
>> Overhead:                   1,754,153 pps /  2.86 ns/pkt
>>
>> As per the above, this is with calling three kfuncs/pkt
>> (metadata_supported(), rx_hash_supported() and rx_hash()). So that's
>> ~0.95 ns per function call, which is a bit less, but not far off from
>> the ~1.2 ns that I'm used to. The tests where I accidentally called the
>> default kfuncs cut off ~1.3 ns for one less kfunc call, so it's
>> definitely in that ballpark.
>>
>> I'm not doing anything with the data, just reading it into an on-stack
>> buffer, so this is the smallest possible delta from just getting the
>> data out of the driver. I did confirm that the call instructions are
>> still in the BPF program bytecode when it's dumped back out from the
>> kernel.
>>
>> -Toke
>>
>
> Oh, that's great, thanks for running the numbers! Will definitely
> reference them in v4!
> Presumably, we should be able to at least unroll most of the
> _supported callbacks if we want, they should be relatively easy; but
> the numbers look fine as is?

Well, this is for one (and a half) piece of metadata. If we extrapolate
it adds up quickly. Say we add csum and vlan tags, say, and maybe
another callback to get the type of hash (l3/l4). Those would probably
be relevant for most packets in a fairly common setup. Extrapolating
from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of the
baseline of 39 ns.

So in that sense I still think unrolling makes sense. At least for the
_supported() calls, as eating a whole function call just for that is
probably a bit much (which I think was also Jakub's point in a sibling
thread somewhere).

-Toke