Re: [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog

Daniel Xu <dxu@xxxxxxxxx> · Wed, 11 Sep 2024 12:53:46 -0600

On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 11/09/2024 06.43, Daniel Xu wrote:
> > [cc Jesper]
> > 
> > On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
> > > On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
> > > > On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> > > > > 
> > > > > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
> > > > > > On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote:
> > > > > > > 
> [...cut...]
> 
> > > > Can you give us a bit more details on what
> > > > you are trying to achieve?
> > > 
> > > BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
> > > entry in the cpumap. When a prog redirects to an entry in the cpumap,
> > > the machinery queues up the xdp frame onto the destination CPU ptr_ring.
> > > This can occur on any cpu, thus multi-producer. On processing side,
> > > there is only the kthread created by the cpumap entry and bound to the
> > > specific cpu that is consuming entries. So single consumer.
> > > 
> 
> An important detail: to get Multi-Producer (MP) to scale the CPUMAP does
> bulk enqueue into the ptr_ring.  It stores the xdp_frame's in a per-CPU
> array and does the flush/enqueue as part of the xdp_do_flush(). Because
> I was afraid of this adding latency, I choose to also flush every 8
> frames (CPU_MAP_BULK_SIZE).
> 
> Looking at code I see this is also explained in a comment:
> 
> /* General idea: XDP packets getting XDP redirected to another CPU,
>  * will maximum be stored/queued for one driver ->poll() call.  It is
>  * guaranteed that queueing the frame and the flush operation happen on
>  * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
>  * which queue in bpf_cpu_map_entry contains packets.
>  */
> 
> 
> > > Goal is to track the latency overhead added from ptr_ring and the
> > > kthread (versus softirq where is less overhead). Ideally we want p50,
> > > p90, p95, p99 percentiles.
> > > 
> 
> I'm very interesting in this use-case of understanding the latency of
> CPUMAP.
> I'm a fan of latency histograms that I turn into heatmaps in grafana.
> 
> > > To do this, we need to track every single entry enqueue time as well as
> > > dequeue time - events that occur in the tail are quite important.
> > > 
> > > Since ptr_ring is also a ring buffer, I thought it would be easy,
> > > reliable, and fast to just create a "shadow" ring buffer. Every time
> > > producer enqueues entries, I'd enqueue the same number of current
> > > timestamp onto shadow RB. Same thing on consumer side, except dequeue
> > > and calculate timestamp delta.
> > > 
> 
> This idea seems overkill and will likely produce unreliable results.
> E.g. the overhead of this additional ring buffer will also affect the
> measurements.

Yeah, good point.

> 
> > > I was originally planning on writing my own lockless ring buffer in pure
> > > BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
> > > could avoid that with this patch.
> > 
> > [...]
> > 
> > Alternatively, could add a u64 timestamp to xdp_frame, which makes all
> > this tracking inline (and thus more reliable). But I'm not sure how precious
> > the space in that struct is - I see some references online saying most drivers
> > save 128B headroom. I also see:
> > 
> >          #define XDP_PACKET_HEADROOM 256
> > 
> 
> I like the inline idea. I would suggest to add u64 timestamp into
> XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in
> RX-NAPI.  Then at the remote CPU you can run another CPUMAP-XDP program that
> pickup this timestamp, and then calc a delta from "now" timestamp.
> 
> 
>  [1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77

Cool! This is a much better idea than mine :)

I'll give this a try.

> 
> 
> > Could probably amortize the timestamp read by setting it in
> > bq_flush_to_queue().
> 
> To amortize, consider that you might not need to timestamp EVERY packet to
> get sufficient statistics on the latency.
> 
> Regarding bq_flush_to_queue() and the enqueue tracepoint:
>   trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
> 
> I have an idea for you, on how to measure the latency overhead from XDP
> RX-processing to when enqueue "flush" happens.  It is a little tricky to
> explain, so I will outline the steps.
> 
> 1. XDP bpf_prog store timestamp in per-CPU array,
>    unless timestamp is already set.
> 
> 2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp
>    and calc latency diff, and clears timestamp.
> 
> This measures the latency overhead of bulk enqueue. (Notice: Only the
> first XDP redirect frame after a bq_flush_to_queue() will set the
> timestamp). This per-CPU store should work as this all runs under same
> RX-NAPI "poll" execution.

Makes sense to me. This breaks down the latency even further. I'll keep
it in mind if we need further troubleshooting.

> This latency overhead of bulk enqueue, will (unfortunately) also
> count/measure the XDP_PASS packets that gets processed by the normal
> netstack.  So, watch out for this. e.g could have XDP actions (e.g
> XDP_PASS) counters as part of step 1, and have statistic for cases where
> XDP_PASS interfered.

Not sure I got this. If we only set the percpu timestamp for
XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I
misunderstand something.

Thanks,
Daniel