On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote: > > > On 11/09/2024 06.43, Daniel Xu wrote: > > [cc Jesper] > > > > On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote: > > > On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote: > > > > On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote: > > > > > > > > > > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote: > > > > > > On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote: > > > > > > > > [...cut...] > > > > > Can you give us a bit more details on what > > > > you are trying to achieve? > > > > > > BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each > > > entry in the cpumap. When a prog redirects to an entry in the cpumap, > > > the machinery queues up the xdp frame onto the destination CPU ptr_ring. > > > This can occur on any cpu, thus multi-producer. On processing side, > > > there is only the kthread created by the cpumap entry and bound to the > > > specific cpu that is consuming entries. So single consumer. > > > > > An important detail: to get Multi-Producer (MP) to scale the CPUMAP does > bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU > array and does the flush/enqueue as part of the xdp_do_flush(). Because > I was afraid of this adding latency, I choose to also flush every 8 > frames (CPU_MAP_BULK_SIZE). > > Looking at code I see this is also explained in a comment: > > /* General idea: XDP packets getting XDP redirected to another CPU, > * will maximum be stored/queued for one driver ->poll() call. It is > * guaranteed that queueing the frame and the flush operation happen on > * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() > * which queue in bpf_cpu_map_entry contains packets. > */ > > > > > Goal is to track the latency overhead added from ptr_ring and the > > > kthread (versus softirq where is less overhead). Ideally we want p50, > > > p90, p95, p99 percentiles. > > > > > I'm very interesting in this use-case of understanding the latency of > CPUMAP. > I'm a fan of latency histograms that I turn into heatmaps in grafana. > > > > To do this, we need to track every single entry enqueue time as well as > > > dequeue time - events that occur in the tail are quite important. > > > > > > Since ptr_ring is also a ring buffer, I thought it would be easy, > > > reliable, and fast to just create a "shadow" ring buffer. Every time > > > producer enqueues entries, I'd enqueue the same number of current > > > timestamp onto shadow RB. Same thing on consumer side, except dequeue > > > and calculate timestamp delta. > > > > > This idea seems overkill and will likely produce unreliable results. > E.g. the overhead of this additional ring buffer will also affect the > measurements. Yeah, good point. > > > > I was originally planning on writing my own lockless ring buffer in pure > > > BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I > > > could avoid that with this patch. > > > > [...] > > > > Alternatively, could add a u64 timestamp to xdp_frame, which makes all > > this tracking inline (and thus more reliable). But I'm not sure how precious > > the space in that struct is - I see some references online saying most drivers > > save 128B headroom. I also see: > > > > #define XDP_PACKET_HEADROOM 256 > > > > I like the inline idea. I would suggest to add u64 timestamp into > XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in > RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that > pickup this timestamp, and then calc a delta from "now" timestamp. > > > [1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77 Cool! This is a much better idea than mine :) I'll give this a try. > > > > Could probably amortize the timestamp read by setting it in > > bq_flush_to_queue(). > > To amortize, consider that you might not need to timestamp EVERY packet to > get sufficient statistics on the latency. > > Regarding bq_flush_to_queue() and the enqueue tracepoint: > trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu) > > I have an idea for you, on how to measure the latency overhead from XDP > RX-processing to when enqueue "flush" happens. It is a little tricky to > explain, so I will outline the steps. > > 1. XDP bpf_prog store timestamp in per-CPU array, > unless timestamp is already set. > > 2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp > and calc latency diff, and clears timestamp. > > This measures the latency overhead of bulk enqueue. (Notice: Only the > first XDP redirect frame after a bq_flush_to_queue() will set the > timestamp). This per-CPU store should work as this all runs under same > RX-NAPI "poll" execution. Makes sense to me. This breaks down the latency even further. I'll keep it in mind if we need further troubleshooting. > This latency overhead of bulk enqueue, will (unfortunately) also > count/measure the XDP_PASS packets that gets processed by the normal > netstack. So, watch out for this. e.g could have XDP actions (e.g > XDP_PASS) counters as part of step 1, and have statistic for cases where > XDP_PASS interfered. Not sure I got this. If we only set the percpu timestamp for XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I misunderstand something. Thanks, Daniel