On Thu, Sep 12, 2024, at 2:40 AM, Jesper Dangaard Brouer wrote: > On 11/09/2024 20.53, Daniel Xu wrote: >> On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote: >>> >>> >>> On 11/09/2024 06.43, Daniel Xu wrote: >>>> [cc Jesper] >>>> >>>> On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote: >>>>> On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote: >>>>>> On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote: >>>>>>> >>>>>>> On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote: >>>>>>>> On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote: >>>>>>>>> >>> [...cut...] >>> >>>>>> Can you give us a bit more details on what >>>>>> you are trying to achieve? >>>>> >>>>> BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each >>>>> entry in the cpumap. When a prog redirects to an entry in the cpumap, >>>>> the machinery queues up the xdp frame onto the destination CPU ptr_ring. >>>>> This can occur on any cpu, thus multi-producer. On processing side, >>>>> there is only the kthread created by the cpumap entry and bound to the >>>>> specific cpu that is consuming entries. So single consumer. >>>>> >>> >>> An important detail: to get Multi-Producer (MP) to scale the CPUMAP does >>> bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU >>> array and does the flush/enqueue as part of the xdp_do_flush(). Because >>> I was afraid of this adding latency, I choose to also flush every 8 >>> frames (CPU_MAP_BULK_SIZE). >>> >>> Looking at code I see this is also explained in a comment: >>> >>> /* General idea: XDP packets getting XDP redirected to another CPU, >>> * will maximum be stored/queued for one driver ->poll() call. It is >>> * guaranteed that queueing the frame and the flush operation happen on >>> * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() >>> * which queue in bpf_cpu_map_entry contains packets. >>> */ >>> >>> >>>>> Goal is to track the latency overhead added from ptr_ring and the >>>>> kthread (versus softirq where is less overhead). Ideally we want p50, >>>>> p90, p95, p99 percentiles. >>>>> >>> >>> I'm very interesting in this use-case of understanding the latency of >>> CPUMAP. >>> I'm a fan of latency histograms that I turn into heatmaps in grafana. >>> >>>>> To do this, we need to track every single entry enqueue time as well as >>>>> dequeue time - events that occur in the tail are quite important. >>>>> >>>>> Since ptr_ring is also a ring buffer, I thought it would be easy, >>>>> reliable, and fast to just create a "shadow" ring buffer. Every time >>>>> producer enqueues entries, I'd enqueue the same number of current >>>>> timestamp onto shadow RB. Same thing on consumer side, except dequeue >>>>> and calculate timestamp delta. >>>>> >>> >>> This idea seems overkill and will likely produce unreliable results. >>> E.g. the overhead of this additional ring buffer will also affect the >>> measurements. >> >> Yeah, good point. >> >>> >>>>> I was originally planning on writing my own lockless ring buffer in pure >>>>> BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I >>>>> could avoid that with this patch. >>>> >>>> [...] >>>> >>>> Alternatively, could add a u64 timestamp to xdp_frame, which makes all >>>> this tracking inline (and thus more reliable). But I'm not sure how precious >>>> the space in that struct is - I see some references online saying most drivers >>>> save 128B headroom. I also see: >>>> >>>> #define XDP_PACKET_HEADROOM 256 >>>> >>> >>> I like the inline idea. I would suggest to add u64 timestamp into >>> XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in >>> RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that >>> pickup this timestamp, and then calc a delta from "now" timestamp. >>> >>> >>> [1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77 >> >> Cool! This is a much better idea than mine :) >> >> I'll give this a try. >> >>> >>> >>>> Could probably amortize the timestamp read by setting it in >>>> bq_flush_to_queue(). >>> >>> To amortize, consider that you might not need to timestamp EVERY packet to >>> get sufficient statistics on the latency. >>> >>> Regarding bq_flush_to_queue() and the enqueue tracepoint: >>> trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu) >>> >>> I have an idea for you, on how to measure the latency overhead from XDP >>> RX-processing to when enqueue "flush" happens. It is a little tricky to >>> explain, so I will outline the steps. >>> >>> 1. XDP bpf_prog store timestamp in per-CPU array, >>> unless timestamp is already set. >>> >>> 2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp >>> and calc latency diff, and clears timestamp. >>> >>> This measures the latency overhead of bulk enqueue. (Notice: Only the >>> first XDP redirect frame after a bq_flush_to_queue() will set the >>> timestamp). This per-CPU store should work as this all runs under same >>> RX-NAPI "poll" execution. >> >> Makes sense to me. This breaks down the latency even further. I'll keep >> it in mind if we need further troubleshooting. >> > > Yes, this breaks down the latency even further :-) > >>> This latency overhead of bulk enqueue, will (unfortunately) also >>> count/measure the XDP_PASS packets that gets processed by the normal >>> netstack. So, watch out for this. e.g could have XDP actions (e.g >>> XDP_PASS) counters as part of step 1, and have statistic for cases where >>> XDP_PASS interfered. >> >> Not sure I got this. If we only set the percpu timestamp for >> XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I >> misunderstand something. >> > > Not quite. You only set timestamp for the first XDP_REDIRECT frames. > I'm simply saying, all the packets processed *after* this timestamp will > attribute to the time it takes until trace_xdp_cpumap_enqueue() runs. > That part should be obvious. Then, I'm saying remember that an XDP_PASS > packet takes more time than a XDP_REDIRECT packet. I hope this makes it > more clear. Point: This is a pitfall you need to be aware of when > looking at your metrics. > > For the inline timestamping same pitfall actually applies. There you > timestamp the packets themselves. Because the CPUMAP enqueue happens as > the *LAST* thing in NAPI loop, during xdp_do_flush() call. This means > that interleaved normal netstack (XDP_PASS) packets will be processed > BEFORE this call to xdp_do_flush(). As noted earlier, to compensate for > this effect, code will enqueue-flush every 8 frames (CPU_MAP_BULK_SIZE). Ah, that makes sense. Thanks for explaining. > > I hope, I've not confused you more :-| > No, that was helpful! Thanks, Daniel