[cc Jesper] On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote: > On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote: >> On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@xxxxxxxxx> wrote: >> > >> > On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote: >> > > On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@xxxxxxxxx> wrote: >> > > > >> > > > >> > > > >> > > > On Tue, Sep 10, 2024, at 2:07 PM, Daniel Xu wrote: >> > > > > On Tue, Sep 10, 2024 at 01:41:41PM GMT, Andrii Nakryiko wrote: >> > > > >> On Tue, Sep 10, 2024 at 11:36 AM Alexei Starovoitov >> > > > [...] >> > > > > >> > > > >> >> > > > >> Also, Daniel, can you please make sure that dynptr we return for each >> > > > >> sample is read-only? We shouldn't let consumer BPF program ability to >> > > > >> corrupt ringbuf record headers (accidentally or otherwise). >> > > > > >> > > > > Sure. >> > > > >> > > > So the sample is not read-only. But I think prog is prevented from messing >> > > > with header regardless. >> > > > >> > > > __bpf_user_ringbuf_peek() returns sample past the header: >> > > > >> > > > *sample = (void *)((uintptr_t)rb->data + >> > > > (uintptr_t)((cons_pos + BPF_RINGBUF_HDR_SZ) & rb->mask)); >> > > > >> > > > dynptr is initialized with the above ptr: >> > > > >> > > > bpf_dynptr_init(&dynptr, sample, BPF_DYNPTR_TYPE_LOCAL, 0, size); >> > > > >> > > > So I don't think there's a way for the prog to access the header thru the dynptr. >> > > > >> > > >> > > By "header" I mean 8 bytes that precede each submitted ringbuf record. >> > > That header is part of ringbuf data area. Given user space can set >> > > consumer_pos to arbitrary value, kernel can return arbitrary part of >> > > ringbuf data area, including that 8 byte header. If that data is >> > > writable, it's easy to screw up that header and crash another BPF >> > > program that reserves/submits a new record. User space can only read >> > > data area for BPF ringbuf, and so we rely heavily on a tight control >> > > of who can write what into those 8 bytes. >> > >> > Ah, ok. I think I understand. >> > >> > Given this and your other comments about rb->busy, what about enforcing >> > bpf_user_ringbuf_drain() NAND mmap? I think the use cases here are >> > different enough where this makes sense. >> >> You mean disabling user-space mmap()? TBH, I'd like to understand the >> use case first before we make such decisions. Maybe what you need is >> not really a BPF ringbuf? Can you give us a bit more details on what >> you are trying to achieve? > > BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each > entry in the cpumap. When a prog redirects to an entry in the cpumap, > the machinery queues up the xdp frame onto the destination CPU ptr_ring. > This can occur on any cpu, thus multi-producer. On processing side, > there is only the kthread created by the cpumap entry and bound to the > specific cpu that is consuming entries. So single consumer. > > Goal is to track the latency overhead added from ptr_ring and the > kthread (versus softirq where is less overhead). Ideally we want p50, > p90, p95, p99 percentiles. > > To do this, we need to track every single entry enqueue time as well as > dequeue time - events that occur in the tail are quite important. > > Since ptr_ring is also a ring buffer, I thought it would be easy, > reliable, and fast to just create a "shadow" ring buffer. Every time > producer enqueues entries, I'd enqueue the same number of current > timestamp onto shadow RB. Same thing on consumer side, except dequeue > and calculate timestamp delta. > > I was originally planning on writing my own lockless ring buffer in pure > BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I > could avoid that with this patch. [...] Alternatively, could add a u64 timestamp to xdp_frame, which makes all this tracking inline (and thus more reliable). But I'm not sure how precious the space in that struct is - I see some references online saying most drivers save 128B headroom. I also see: #define XDP_PACKET_HEADROOM 256 Could probably amortize the timestamp read by setting it in bq_flush_to_queue().