Re: [PATCH] bpf: call get_random_u32() for random integers

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 06 Dec 2022 14:26:38 +0100

"Jason A. Donenfeld" <Jason@xxxxxxxxx> writes:

> Hi Toke,
>
> On Tue, Dec 6, 2022 at 1:50 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> "Jason A. Donenfeld" <Jason@xxxxxxxxx> writes:
>>
>> > On Mon, Dec 05, 2022 at 11:21:51PM +0100, Daniel Borkmann wrote:
>> >> On 12/5/22 7:15 PM, Jason A. Donenfeld wrote:
>> >> > Since BPF's bpf_user_rnd_u32() was introduced, there have been three
>> >> > significant developments in the RNG: 1) get_random_u32() returns the
>> >> > same types of bytes as /dev/urandom, eliminating the distinction between
>> >> > "kernel random bytes" and "userspace random bytes", 2) get_random_u32()
>> >> > operates mostly locklessly over percpu state, 3) get_random_u32() has
>> >> > become quite fast.
>> >>
>> >> Wrt "quite fast", do you have a comparison between the two? Asking as its
>> >> often used in networking worst case on per packet basis (e.g. via XDP), would
>> >> be useful to state concrete numbers for the two on a given machine.
>> >
>> > Median of 25 cycles vs median of 38, on my Tiger Lake machine. So a
>> > little slower, but too small of a difference to matter.
>>
>> Assuming a 3Ghz CPU clock (so 3 cycles per nanosecond), that's an
>> additional overhead of ~4.3 ns. When processing 10 Gbps at line rate
>> with small packets, the per-packet processing budget is 67.2 ns, so
>> those extra 4.3 ns will eat up ~6.4% of the budget.
>>
>> So in other words, "too small a difference to matter" is definitely not
>> true in general. It really depends on the use case; if someone is using
>> this to, say, draw per-packet random numbers to compute a drop frequency
>> on ingress, that extra processing time will most likely result in a
>> quite measurable drop in performance.
>
> Huh, neat calculation, I'll keep that method in mind.

Yeah, I find that thinking in "time budget per packet" helps a lot! For
completeness, the 67.2 ns comes from 10 Gbps / 84 bytes (that's 64-byte
packets + 20 bytes of preamble and inter-packet gap), which is 14.8
Mpps, or 67.2 ns/pkt.

> Alright, sorry for the noise here. I'll check back in if I ever manage
> to eliminate that performance gap.

One approach that we use generally for XDP (and which may or may not
help for this case) is that we try to amortise as much fixed cost as we
can over a batch of packets. Because XDP processing always happens in
the NAPI receive poll loop, we have a natural batching mechanism, and we
know that for that whole loop we'll keep running on the same CPU.

So for instance, if there's a large fixed component of the overhead of
get_random_u32(), we could have bpf_user_rnd_u32() populate a larger
per-CPU buffer and then just emit u32 chunks of that as long as we're
still in the same NAPI loop as the first call. Or something to that
effect. Not sure if this makes sense for this use case, but figured I'd
throw the idea out there :)

-Toke