Re: [EXTERNAL] Re: Maximum size of record over perf ring buffer?

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 29 Oct 2020 15:48:38 -0700

On Wed, Oct 28, 2020 at 12:03 PM Kevin Sheldrake
<Kevin.Sheldrake@xxxxxxxxxxxxx> wrote:
>
> > -----Original Message-----
> > From: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx>
> > Sent: 27 October 2020 03:44
> >
> > On Mon, Oct 26, 2020 at 6:07 PM KP Singh <kpsingh@xxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake
> > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Hello Andrii and list
> > > > >
> > > > > I've now had chance to properly investigate the perf ring buffer
> > corruption bug.  Essentially (as suspected), the size parameter that is
> > specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside
> > the struct perf_event_header size parameter
> > ($KERNELSRC/include/uapi/linux/perf_event.h and
> > /usr/include/linux/perf_event.h).
> > > > >
> <SNIP>
>
> > > The size argument is of the type ARG_CONST_SIZE_OR_ZERO:
> > >
> > > static const struct bpf_func_proto bpf_perf_event_output_proto = {
> > >    .func = bpf_perf_event_output,
> > >    .gpl_only = true,
> > >    .ret_type = RET_INTEGER,
> > >    .arg1_type = ARG_PTR_TO_CTX,
> > >    .arg2_type = ARG_CONST_MAP_PTR,
> > >    .arg3_type = ARG_ANYTHING,
> > >    .arg4_type = ARG_PTR_TO_MEM,
> > >    .arg5_type = ARG_CONST_SIZE_OR_ZERO, };
> > >
> > > and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ:
> > >
> > > if (reg->umax_value >= BPF_MAX_VAR_SIZ) {
> > >    verbose(env, "R%d unbounded memory access, use 'var &= const' or
> > > 'if (var < const)'\n",
> > >       regno);
> > >    return -EACCES;
> > > }
> >
> > You are right, of course, my bad. Verifier might not know the exact value, but
> > it enforces the upper bound. So in this case we can additionally enforce extra
> > upper bound for bpf_perf_event_output().
> > Though, given this will require kernel upgrade, I'd just stick to BPF ringbuf at
> > that point ;)
> >
> > >
> > > it's just that bpf_perf_event_output expects the size to be even
> > > smaller than 32 bits (i.e. 16 bits).
> > >
> > > > BPF verifier can't do much about that. It seems like the proper
> > > > solution is to do the check in bpf_perf_event_output() BPF helper
> > > > itself. Returning -E2BIG is an appropriate behavior here, rather
> > > > than
> > >
> > > This could be a solution (and maybe better than the verifier check).
> > >
> > > But I do think perf needs to have the check instead of
> > bpf_perf_event_output:
> >
> > Yeah, of course, perf subsystem itself shouldn't allow data corruption. My
> > point was to do it as a runtime check, rather than enforce at verification time.
> > But both would work fine.
>
> Appreciate this is a fix in the verifier and not the perf subsystem, but would something like this be acceptable?
>
> /usr/src/linux$ diff include/linux/bpf_verifier.h.orig include/linux/bpf_verifier.h
> 19a20,24
> > /* Maximum variable size permitted for size param to bpf_perf_event_output().
> >  * This ensures the samples sent into the perf ring buffer do not overflow the
> >  * size parameter in the perf event header.
> >  */
> > #define BPF_MAX_PERF_SAMP_SIZ ((1 << (sizeof(((struct perf_event_header *)0)->size) * 8)) - 24)
> /usr/src/linux$ diff kernel/bpf/verifier.c.orig kernel/bpf/verifier.c
> 4599c4599
> <       struct bpf_reg_state *regs;
> ---
> >       struct bpf_reg_state *regs, *reg;
> 4653a4654,4662

you didn't use unified diff format, so it's hard to tell where exactly
you made this change. But please post a proper patch and let's review
it properly.

> >       /* special check for bpf_perf_event_output() size */
> >       regs = cur_regs(env);
> >       reg = &regs[BPF_REG_5];
> >       if (func_id == BPF_FUNC_perf_event_output && reg->umax_value >= BPF_MAX_PERF_SAMP_SIZ) {
> >               verbose(env, "bpf_perf_output_event()#%d size parameter must be less than %ld\n",
> >                       BPF_FUNC_perf_event_output, BPF_MAX_PERF_SAMP_SIZ);
> >               return -E2BIG;
> >       }
> >
> 4686,4687d4694
> <
> <       regs = cur_regs(env);
>
> I couldn't find the details on the size of the header/padding for the sample.  The struct perf_event_header is 16 bytes, whereas the struct perf_raw_record mentioned below is 40 bytes, but the actual value determined by experimentation is 24.
>
> If acceptable, and given that it protects the perf ring buffer from corruption, could it be a candidate for back-porting?
>
> Thanks
>
> Kev
>
>
>
>
>
> >
> > >
> > > The size in the perf always seems to be u32 except the
> > > perf_event_header (I assume this is to save some space on the ring
> > > buffer)
> >
> > Probably saving space, yeah. Though it's wasting 3 lower bits because all sizes
> > seem to be multiple of 8 always. So could the real limit could be 8 * 64K =
> > 512KB, easily. But it's a bit late now.
> >
> > >
> > > struct perf_raw_frag {
> > >      union {
> > >          struct perf_raw_frag *next;
> > >          unsigned long pad;
> > >      };
> > >     perf_copy_f copy;
> > >     void *data;
> > >     u32 size;
> > > } __packed;
> > >
> > > struct perf_raw_record {
> > >    struct perf_raw_frag frag;
> > >     u32 size;
> > > };
> > >
> > > Maybe we can just add the check to perf_event_output instead?
> > >
> > > - KP
> > >
> > > > corrupting data. __u64 if helper definition is fine as well, because
> > > > that's the size of BPF registers that are used to pass arguments to
> > > > helpers.
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Kevin Sheldrake
> > > > >
> > > > > PS I will get around to the clang/LLVM jump offset warning soon I
> > promise.
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: bpf-owner@xxxxxxxxxxxxxxx <bpf-owner@xxxxxxxxxxxxxxx> On
> > > > > > Behalf Of Kevin Sheldrake
> > > > > > Sent: 24 July 2020 10:40
> > > > > > To: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx>
> > > > > > Cc: bpf@xxxxxxxxxxxxxxx
> > > > > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > >
> > > > > > Hello Andrii
> > > > > >
> > > > > > Thank you for taking a look at this.  While the size is reported
> > > > > > correctly to the consumer (bar padding, etc), the actual offsets
> > > > > > between adjacent pointers appears to either have been cast to a
> > > > > > u16 or otherwise masked with 0xFFFF, causing what I believe to
> > > > > > be overlapping samples and the opportunity for sample corruption in
> > the overlapped regions.
> > > > > >
> > > > > > Thanks again
> > > > > >
> > > > > > Kev
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx>
> > > > > > Sent: 23 July 2020 20:05
> > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@xxxxxxxxxxxxx>
> > > > > > Cc: bpf@xxxxxxxxxxxxxxx
> > > > > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > >
> > > > > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake
> > > > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hello
> > > > > > >
> > > > > > > Thank you for your response; I hope you don't mind me
> > > > > > > top-posting.  I've
> > > > > > put together a POC that demonstrates my results.  Edit the size
> > > > > > of the data char array in event_defs.h to change the behaviour.
> > > > > > >
> > > > > > >
> > > > > >
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%
> > > > > > 2Fgith
> > > > > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC-
> > > > > > Research%2Febpf_
> > > > > > >
> > > > > >
> > perf_output_poc&amp;data=02%7C01%7CKevin.Sheldrake%40microsoft.c
> > > > > > o
> > > > > > m%7C8
> > > > > > >
> > > > > >
> > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db
> > > > > > 47
> > > > > > %7C1
> > > > > > >
> > > > > >
> > %7C0%7C637311279211606351&amp;sdata=jMMpfi%2Bd%2B7jZzMT905xJ61
> > > > > > 34cDJd5u
> > > > > > > MNSu9RCdx4M6s%3D&amp;reserved=0
> > > > > >
> > > > > > I haven't run your program, but I can certainly reproduce this
> > > > > > using bench_perfbuf in selftests. It does seem like something is
> > > > > > silently corrupted, because the size reported by perf is correct
> > > > > > (plus/minus few bytes, probably rounding up to 8 bytes), but the
> > > > > > contents is not correct. I have no idea why that's happening,
> > > > > > maybe someone more familiar with the perf subsystem can take a
> > look.
> > > > > >
> > > > > > >
> > > > > > > Unfortunately, our project aims to run on older kernels than
> > > > > > > 5.8 so the bpf
> > > > > > ring buffer won't work for us.
> > > > > > >
> > > > > > > Thanks again
> > > > > > >
> > > > > > > Kevin Sheldrake
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: bpf-owner@xxxxxxxxxxxxxxx <bpf-owner@xxxxxxxxxxxxxxx>
> > On
> > > > > > Behalf
> > > > > > > Of Andrii Nakryiko
> > > > > > > Sent: 20 July 2020 05:35
> > > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@xxxxxxxxxxxxx>
> > > > > > > Cc: bpf@xxxxxxxxxxxxxxx
> > > > > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring
> > buffer?
> > > > > > >
> > > > > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake
> > > > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Hello
> > > > > > > >
> > > > > > > > I'm building a tool using EBPF/libbpf/C and I've run into an
> > > > > > > > issue that I'd
> > > > > > like to ask about.  I haven't managed to find documentation for
> > > > > > the maximum size of a record that can be sent over the perf ring
> > > > > > buffer, but experimentation (on kernel 5.3 (x64) with latest
> > > > > > libbpf from github) suggests it is just short of 64KB.  Please
> > > > > > could someone confirm if that's the case or not?  My experiments
> > > > > > suggest that sending a record that is greater than 64KB results
> > > > > > in the size reported in the callback being correct but the
> > > > > > records overlapping, causing corruption if they are not serviced
> > > > > > as quickly as they arrive.  Setting the record to exactly 64KB results in
> > no records being received at all.
> > > > > > > >
> > > > > > > > For reference, I'm using perf_buffer__new() and
> > > > > > > > perf_buffer__poll() on
> > > > > > the userland side; and bpf_perf_event_output(ctx, &event_map,
> > > > > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side.
> > > > > > > >
> > > > > > > > Additionally, is there a better architecture for sending
> > > > > > > > large volumes of
> > > > > > data (>64KB) back from the EBPF program to userland, such as a
> > > > > > different ring buffer, a map, some kind of shared mmaped
> > > > > > segment, etc, other than simply fragmenting the data?  Please
> > > > > > excuse my naivety as I'm relatively new to the world of EBPF.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not aware of any such limitations for perf ring buffer and
> > > > > > > I haven't had a
> > > > > > chance to validate this. It would be great if you can provide a
> > > > > > small repro so that someone can take a deeper look, it does
> > > > > > sound like a bug, if you really get clobbered data. It might be
> > > > > > actually how you set up perfbuf, AFAIK, it has a mode where it
> > > > > > will override the data, if it's not consumed quickly enough, but you
> > need to consciously enable that mode.
> > > > > > >
> > > > > > > But apart from that, shameless plug here, you can try the new
> > > > > > > BPF ring
> > > > > > buffer ([0]), available in 5.8+ kernels. It will allow you to
> > > > > > avoid extra copy of data you get with bpf_perf_event_output(),
> > > > > > if you use BPF ringbuf's
> > > > > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has
> > > > > > bpf_ringbuf_output() API, which is logically  equivalent to
> > > > > > bpf_perf_event_output(). And it has a very high limit on sample
> > > > > > size, up to 512MB per sample.
> > > > > > >
> > > > > > > Keep in mind, BPF ringbuf is MPSC design and if you use just
> > > > > > > one BPF
> > > > > > ringbuf across all CPUs, you might run into some contention
> > > > > > across multiple CPU. It is acceptable in a lot of applications I
> > > > > > was targeting, but if you have a high frequency of events (keep
> > > > > > in mind, throughput doesn't matter, only contention on sample
> > > > > > reservation matters), you might want to use an array of BPF
> > > > > > ringbufs to scale throughput. You can do 1 ringbuf per each CPU
> > > > > > for ultimate performance at the expense of memory usage (that's
> > > > > > perf ring buffer setup), but BPF ringbuf is flexible enough to
> > > > > > allow any topology that makes sense for you use case, from 1 shared
> > ringbuf across all CPUs, to anything in between.
> > > > > > >
> > > > > > >