Re: [PATCH] bpf: smp_wmb before bpf_ringbuf really commit

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 6 Nov 2024 12:02:51 -0800

On Fri, Nov 1, 2024 at 1:30 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Fri, Nov 01, 2024 at 12:12:58PM -0700, Andrii Nakryiko wrote:
> > On Thu, Oct 31, 2024 at 9:19 AM zhongjinji <zhongjinji@xxxxxxxxx> wrote:
> > >
> > > >It seems you didn't use the ringbuf related APIs (e.g.,
> > > >ring_buffer__consume()) in libbpf to access the ring buffer. In my
> > > >understanding, the "xchg(&hdr->len, new_len)" in bpf_ringbuf_commit()
> > > >works as the barrier to ensure the order of committed data and hdr->len,
> > > >and accordingly the "smp_load_acquire(len_ptr)" in
> > > >ringbuf_process_ring() in libbpf works as the paired barrier to ensure
> > > >the order of hdr->len and the committed data. So I think the extra
> > > >smp_wmb() in the kernel is not necessary here. Instead, you should fix
> > > >your code in the userspace.
> > > >>
> > > >> Signed-off-by: zhongjinji <zhongjinji@xxxxxxxxx>
> > > >> ---
> > > >>  kernel/bpf/ringbuf.c | 4 ++++
> > > >>  1 file changed, 4 insertions(+)
> > > >>
> > > >> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> > > >> index e1cfe890e0be..a66059e2b0d6 100644
> > > >> --- a/kernel/bpf/ringbuf.c
> > > >> +++ b/kernel/bpf/ringbuf.c
> > > >> @@ -508,6 +508,10 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, bool discard)
> > > >>      rec_pos = (void *)hdr - (void *)rb->data;
> > > >>      cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
> > > >>
> > > >> +    /* Make sure the modification of data is visible on other CPU's
> > > >> +     * before consume the event
> > > >> +     */
> > > >> +    smp_wmb();
>
> This won't order the prior smp_load_acquire(), but you knew that already.
>
> > > >>      if (flags & BPF_RB_FORCE_WAKEUP)
> > > >>              irq_work_queue(&rb->work);
> > > >>      else if (cons_pos == rec_pos && !(flags & BPF_RB_NO_WAKEUP))
> > >
> > > well, I missed the implementation of __smp_store_release in the x86 architecture,
> > > it is not necessary because __smp_store_release has a memory barrier in x86,
> >
> > yep, we do rely on xchg() and smp_load_acquire() pairing. I'm not
> > familiar with arm64 specifics, I didn't know that this assumption is
> > x86-64-specific. Cc'ed Paul who I believe suggested xchg() as a more
> > performant way of achieving this. That happened years ago so all of
> > us, I'm sure, lost a lot of context by now, but still, let's see.
>
> The xchg() is supposed to be fully ordered on all architectures,
> including arm64.  And smp_load_acquire() provides acquire ordering,
> again on all architectures.  So if you have something like this

I think this is what we need and should be sufficient. Thank you,
Paul, for confirming! Seems like we don't really need to add anything
extra here.

> with x, y, and z all initially zero:
>
>         CPU 0                           CPU 1
>
>         WRITE_ONCE(x, 1);               r3 = smp_load_acquire(&z);
>         r1 = READ_ONCE(y);              r4 = READ_ONCE(x);
>         r2 = xchg(&z, 1);               WRITE_ONCE(y, 1);
>
> Then if r3==1, it is guaranteed that r1==0 and r4==1.
>
> Or are you looking for some other ordering guarantee, and if so, what
> is it?
>
>                                                         Thanx, Paul
>
> > > but it can't guarantee the order of committed data in arm64 because
> > > __smp_store_release does not include a memory barrier, and it just ensure
> > > the variable is visible. The implementation of ARM64 is as follows,
> > > in common/arch/arm64/include/asm/barrier.h
> > >
> > > #define __smp_load_acquire(p)                                           \
> > > ({                                                                      \
> > >         union { __unqual_scalar_typeof(*p) __val; char __c[1]; } __u;   \
> > >         typeof(p) __p = (p);                                            \
> > >         compiletime_assert_atomic_type(*p);                             \
> > >         kasan_check_read(__p, sizeof(*p));                              \
> > >         switch (sizeof(*p)) {                                           \
> > >         case 1:                                                         \
> > >                 asm volatile ("ldarb %w0, %1"                           \
> > >                         : "=r" (*(__u8 *)__u.__c)                       \
> > >                         : "Q" (*__p) : "memory");                       \
> > >                 break;                                                  \
> > >         case 2:                                                         \
> > >                 asm volatile ("ldarh %w0, %1"                           \
> > >                         : "=r" (*(__u16 *)__u.__c)                      \
> > >                         : "Q" (*__p) : "memory");                       \
> > >                 break;                                                  \
> > >         case 4:                                                         \
> > >                 asm volatile ("ldar %w0, %1"                            \
> > >                         : "=r" (*(__u32 *)__u.__c)                      \
> > >                         : "Q" (*__p) : "memory");                       \
> > >                 break;                                                  \
> > >         case 8:                                                         \
> > >                 asm volatile ("ldar %0, %1"                             \
> > >                         : "=r" (*(__u64 *)__u.__c)                      \
> > >                         : "Q" (*__p) : "memory");                       \
> > >                 break;                                                  \
> > >         }                                                               \
> > >         (typeof(*p))__u.__val;                                          \
> > > })
> > >