On 11/09/2024 20.53, Daniel Xu wrote:
On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote:
On 11/09/2024 06.43, Daniel Xu wrote:
[cc Jesper]
On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
[...cut...]
Can you give us a bit more details on what
you are trying to achieve?
BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
entry in the cpumap. When a prog redirects to an entry in the cpumap,
the machinery queues up the xdp frame onto the destination CPU ptr_ring.
This can occur on any cpu, thus multi-producer. On processing side,
there is only the kthread created by the cpumap entry and bound to the
specific cpu that is consuming entries. So single consumer.
An important detail: to get Multi-Producer (MP) to scale the CPUMAP does
bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU
array and does the flush/enqueue as part of the xdp_do_flush(). Because
I was afraid of this adding latency, I choose to also flush every 8
frames (CPU_MAP_BULK_SIZE).
Looking at code I see this is also explained in a comment:
/* General idea: XDP packets getting XDP redirected to another CPU,
* will maximum be stored/queued for one driver ->poll() call. It is
* guaranteed that queueing the frame and the flush operation happen on
* same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
* which queue in bpf_cpu_map_entry contains packets.
*/
Goal is to track the latency overhead added from ptr_ring and the
kthread (versus softirq where is less overhead). Ideally we want p50,
p90, p95, p99 percentiles.
I'm very interesting in this use-case of understanding the latency of
CPUMAP.
I'm a fan of latency histograms that I turn into heatmaps in grafana.
To do this, we need to track every single entry enqueue time as well as
dequeue time - events that occur in the tail are quite important.
Since ptr_ring is also a ring buffer, I thought it would be easy,
reliable, and fast to just create a "shadow" ring buffer. Every time
producer enqueues entries, I'd enqueue the same number of current
timestamp onto shadow RB. Same thing on consumer side, except dequeue
and calculate timestamp delta.
This idea seems overkill and will likely produce unreliable results.
E.g. the overhead of this additional ring buffer will also affect the
measurements.
Yeah, good point.
I was originally planning on writing my own lockless ring buffer in pure
BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
could avoid that with this patch.
[...]
Alternatively, could add a u64 timestamp to xdp_frame, which makes all
this tracking inline (and thus more reliable). But I'm not sure how precious
the space in that struct is - I see some references online saying most drivers
save 128B headroom. I also see:
#define XDP_PACKET_HEADROOM 256
I like the inline idea. I would suggest to add u64 timestamp into
XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in
RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that
pickup this timestamp, and then calc a delta from "now" timestamp.
[1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77
Cool! This is a much better idea than mine :)
I'll give this a try.
Could probably amortize the timestamp read by setting it in
bq_flush_to_queue().
To amortize, consider that you might not need to timestamp EVERY packet to
get sufficient statistics on the latency.
Regarding bq_flush_to_queue() and the enqueue tracepoint:
trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
I have an idea for you, on how to measure the latency overhead from XDP
RX-processing to when enqueue "flush" happens. It is a little tricky to
explain, so I will outline the steps.
1. XDP bpf_prog store timestamp in per-CPU array,
unless timestamp is already set.
2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp
and calc latency diff, and clears timestamp.
This measures the latency overhead of bulk enqueue. (Notice: Only the
first XDP redirect frame after a bq_flush_to_queue() will set the
timestamp). This per-CPU store should work as this all runs under same
RX-NAPI "poll" execution.
Makes sense to me. This breaks down the latency even further. I'll keep
it in mind if we need further troubleshooting.
Yes, this breaks down the latency even further :-)
This latency overhead of bulk enqueue, will (unfortunately) also
count/measure the XDP_PASS packets that gets processed by the normal
netstack. So, watch out for this. e.g could have XDP actions (e.g
XDP_PASS) counters as part of step 1, and have statistic for cases where
XDP_PASS interfered.
Not sure I got this. If we only set the percpu timestamp for
XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I
misunderstand something.
Not quite. You only set timestamp for the first XDP_REDIRECT frames.
I'm simply saying, all the packets processed *after* this timestamp will
attribute to the time it takes until trace_xdp_cpumap_enqueue() runs.
That part should be obvious. Then, I'm saying remember that an XDP_PASS
packet takes more time than a XDP_REDIRECT packet. I hope this makes it
more clear. Point: This is a pitfall you need to be aware of when
looking at your metrics.
For the inline timestamping same pitfall actually applies. There you
timestamp the packets themselves. Because the CPUMAP enqueue happens as
the *LAST* thing in NAPI loop, during xdp_do_flush() call. This means
that interleaved normal netstack (XDP_PASS) packets will be processed
BEFORE this call to xdp_do_flush(). As noted earlier, to compensate for
this effect, code will enqueue-flush every 8 frames (CPU_MAP_BULK_SIZE).
I hope, I've not confused you more :-|
--Jesper