On 11/09/2024 06.43, Daniel Xu wrote:
[cc Jesper]
On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
[...cut...]
Can you give us a bit more details on what
you are trying to achieve?
BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
entry in the cpumap. When a prog redirects to an entry in the cpumap,
the machinery queues up the xdp frame onto the destination CPU ptr_ring.
This can occur on any cpu, thus multi-producer. On processing side,
there is only the kthread created by the cpumap entry and bound to the
specific cpu that is consuming entries. So single consumer.
An important detail: to get Multi-Producer (MP) to scale the CPUMAP does
bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU
array and does the flush/enqueue as part of the xdp_do_flush(). Because
I was afraid of this adding latency, I choose to also flush every 8
frames (CPU_MAP_BULK_SIZE).
Looking at code I see this is also explained in a comment:
/* General idea: XDP packets getting XDP redirected to another CPU,
* will maximum be stored/queued for one driver ->poll() call. It is
* guaranteed that queueing the frame and the flush operation happen on
* same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
* which queue in bpf_cpu_map_entry contains packets.
*/
Goal is to track the latency overhead added from ptr_ring and the
kthread (versus softirq where is less overhead). Ideally we want p50,
p90, p95, p99 percentiles.
I'm very interesting in this use-case of understanding the latency of
CPUMAP.
I'm a fan of latency histograms that I turn into heatmaps in grafana.
To do this, we need to track every single entry enqueue time as well as
dequeue time - events that occur in the tail are quite important.
Since ptr_ring is also a ring buffer, I thought it would be easy,
reliable, and fast to just create a "shadow" ring buffer. Every time
producer enqueues entries, I'd enqueue the same number of current
timestamp onto shadow RB. Same thing on consumer side, except dequeue
and calculate timestamp delta.
This idea seems overkill and will likely produce unreliable results.
E.g. the overhead of this additional ring buffer will also affect the
measurements.
I was originally planning on writing my own lockless ring buffer in pure
BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
could avoid that with this patch.
[...]
Alternatively, could add a u64 timestamp to xdp_frame, which makes all
this tracking inline (and thus more reliable). But I'm not sure how precious
the space in that struct is - I see some references online saying most drivers
save 128B headroom. I also see:
#define XDP_PACKET_HEADROOM 256
I like the inline idea. I would suggest to add u64 timestamp into
XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in
RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program
that pickup this timestamp, and then calc a delta from "now" timestamp.
[1]
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77
Could probably amortize the timestamp read by setting it in
bq_flush_to_queue().
To amortize, consider that you might not need to timestamp EVERY packet
to get sufficient statistics on the latency.
Regarding bq_flush_to_queue() and the enqueue tracepoint:
trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
I have an idea for you, on how to measure the latency overhead from XDP
RX-processing to when enqueue "flush" happens. It is a little tricky to
explain, so I will outline the steps.
1. XDP bpf_prog store timestamp in per-CPU array,
unless timestamp is already set.
2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp
and calc latency diff, and clears timestamp.
This measures the latency overhead of bulk enqueue. (Notice: Only the
first XDP redirect frame after a bq_flush_to_queue() will set the
timestamp). This per-CPU store should work as this all runs under same
RX-NAPI "poll" execution.
This latency overhead of bulk enqueue, will (unfortunately) also
count/measure the XDP_PASS packets that gets processed by the normal
netstack. So, watch out for this. e.g could have XDP actions (e.g
XDP_PASS) counters as part of step 1, and have statistic for cases where
XDP_PASS interfered.
--Jesper