> -----Original Message----- > From: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> > Sent: 27 October 2020 03:44 > > On Mon, Oct 26, 2020 at 6:07 PM KP Singh <kpsingh@xxxxxxxxxxxx> wrote: > > > > On Mon, Oct 26, 2020 at 11:01 PM Andrii Nakryiko > > <andrii.nakryiko@xxxxxxxxx> wrote: > > > > > > On Mon, Oct 26, 2020 at 10:10 AM Kevin Sheldrake > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote: > > > > > > > > Hello Andrii and list > > > > > > > > I've now had chance to properly investigate the perf ring buffer > corruption bug. Essentially (as suspected), the size parameter that is > specified as a __u64 in bpf_helper_defs.h, is truncated into a __u16 inside > the struct perf_event_header size parameter > ($KERNELSRC/include/uapi/linux/perf_event.h and > /usr/include/linux/perf_event.h). > > > > <SNIP> > > The size argument is of the type ARG_CONST_SIZE_OR_ZERO: > > > > static const struct bpf_func_proto bpf_perf_event_output_proto = { > > .func = bpf_perf_event_output, > > .gpl_only = true, > > .ret_type = RET_INTEGER, > > .arg1_type = ARG_PTR_TO_CTX, > > .arg2_type = ARG_CONST_MAP_PTR, > > .arg3_type = ARG_ANYTHING, > > .arg4_type = ARG_PTR_TO_MEM, > > .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; > > > > and we do similar checks in the verifier with the BPF_MAX_VAR_SIZ: > > > > if (reg->umax_value >= BPF_MAX_VAR_SIZ) { > > verbose(env, "R%d unbounded memory access, use 'var &= const' or > > 'if (var < const)'\n", > > regno); > > return -EACCES; > > } > > You are right, of course, my bad. Verifier might not know the exact value, but > it enforces the upper bound. So in this case we can additionally enforce extra > upper bound for bpf_perf_event_output(). > Though, given this will require kernel upgrade, I'd just stick to BPF ringbuf at > that point ;) > > > > > it's just that bpf_perf_event_output expects the size to be even > > smaller than 32 bits (i.e. 16 bits). > > > > > BPF verifier can't do much about that. It seems like the proper > > > solution is to do the check in bpf_perf_event_output() BPF helper > > > itself. Returning -E2BIG is an appropriate behavior here, rather > > > than > > > > This could be a solution (and maybe better than the verifier check). > > > > But I do think perf needs to have the check instead of > bpf_perf_event_output: > > Yeah, of course, perf subsystem itself shouldn't allow data corruption. My > point was to do it as a runtime check, rather than enforce at verification time. > But both would work fine. Appreciate this is a fix in the verifier and not the perf subsystem, but would something like this be acceptable? /usr/src/linux$ diff include/linux/bpf_verifier.h.orig include/linux/bpf_verifier.h 19a20,24 > /* Maximum variable size permitted for size param to bpf_perf_event_output(). > * This ensures the samples sent into the perf ring buffer do not overflow the > * size parameter in the perf event header. > */ > #define BPF_MAX_PERF_SAMP_SIZ ((1 << (sizeof(((struct perf_event_header *)0)->size) * 8)) - 24) /usr/src/linux$ diff kernel/bpf/verifier.c.orig kernel/bpf/verifier.c 4599c4599 < struct bpf_reg_state *regs; --- > struct bpf_reg_state *regs, *reg; 4653a4654,4662 > /* special check for bpf_perf_event_output() size */ > regs = cur_regs(env); > reg = ®s[BPF_REG_5]; > if (func_id == BPF_FUNC_perf_event_output && reg->umax_value >= BPF_MAX_PERF_SAMP_SIZ) { > verbose(env, "bpf_perf_output_event()#%d size parameter must be less than %ld\n", > BPF_FUNC_perf_event_output, BPF_MAX_PERF_SAMP_SIZ); > return -E2BIG; > } > 4686,4687d4694 < < regs = cur_regs(env); I couldn't find the details on the size of the header/padding for the sample. The struct perf_event_header is 16 bytes, whereas the struct perf_raw_record mentioned below is 40 bytes, but the actual value determined by experimentation is 24. If acceptable, and given that it protects the perf ring buffer from corruption, could it be a candidate for back-porting? Thanks Kev > > > > > The size in the perf always seems to be u32 except the > > perf_event_header (I assume this is to save some space on the ring > > buffer) > > Probably saving space, yeah. Though it's wasting 3 lower bits because all sizes > seem to be multiple of 8 always. So could the real limit could be 8 * 64K = > 512KB, easily. But it's a bit late now. > > > > > struct perf_raw_frag { > > union { > > struct perf_raw_frag *next; > > unsigned long pad; > > }; > > perf_copy_f copy; > > void *data; > > u32 size; > > } __packed; > > > > struct perf_raw_record { > > struct perf_raw_frag frag; > > u32 size; > > }; > > > > Maybe we can just add the check to perf_event_output instead? > > > > - KP > > > > > corrupting data. __u64 if helper definition is fine as well, because > > > that's the size of BPF registers that are used to pass arguments to > > > helpers. > > > > > > > > > > > Thanks > > > > > > > > Kevin Sheldrake > > > > > > > > PS I will get around to the clang/LLVM jump offset warning soon I > promise. > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: bpf-owner@xxxxxxxxxxxxxxx <bpf-owner@xxxxxxxxxxxxxxx> On > > > > > Behalf Of Kevin Sheldrake > > > > > Sent: 24 July 2020 10:40 > > > > > To: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> > > > > > Cc: bpf@xxxxxxxxxxxxxxx > > > > > Subject: RE: [EXTERNAL] Re: Maximum size of record over perf ring > buffer? > > > > > > > > > > Hello Andrii > > > > > > > > > > Thank you for taking a look at this. While the size is reported > > > > > correctly to the consumer (bar padding, etc), the actual offsets > > > > > between adjacent pointers appears to either have been cast to a > > > > > u16 or otherwise masked with 0xFFFF, causing what I believe to > > > > > be overlapping samples and the opportunity for sample corruption in > the overlapped regions. > > > > > > > > > > Thanks again > > > > > > > > > > Kev > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> > > > > > Sent: 23 July 2020 20:05 > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@xxxxxxxxxxxxx> > > > > > Cc: bpf@xxxxxxxxxxxxxxx > > > > > Subject: Re: [EXTERNAL] Re: Maximum size of record over perf ring > buffer? > > > > > > > > > > On Mon, Jul 20, 2020 at 4:39 AM Kevin Sheldrake > > > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > Hello > > > > > > > > > > > > Thank you for your response; I hope you don't mind me > > > > > > top-posting. I've > > > > > put together a POC that demonstrates my results. Edit the size > > > > > of the data char array in event_defs.h to change the behaviour. > > > > > > > > > > > > > > > > > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F% > > > > > 2Fgith > > > > > > ub.com%2Fmicrosoft%2FOMS-Auditd-Plugin%2Ftree%2FMSTIC- > > > > > Research%2Febpf_ > > > > > > > > > > > > perf_output_poc&data=02%7C01%7CKevin.Sheldrake%40microsoft.c > > > > > o > > > > > m%7C8 > > > > > > > > > > > > bd9fb551cd4454b87a608d82f3b57c0%7C72f988bf86f141af91ab2d7cd011db > > > > > 47 > > > > > %7C1 > > > > > > > > > > > > %7C0%7C637311279211606351&sdata=jMMpfi%2Bd%2B7jZzMT905xJ61 > > > > > 34cDJd5u > > > > > > MNSu9RCdx4M6s%3D&reserved=0 > > > > > > > > > > I haven't run your program, but I can certainly reproduce this > > > > > using bench_perfbuf in selftests. It does seem like something is > > > > > silently corrupted, because the size reported by perf is correct > > > > > (plus/minus few bytes, probably rounding up to 8 bytes), but the > > > > > contents is not correct. I have no idea why that's happening, > > > > > maybe someone more familiar with the perf subsystem can take a > look. > > > > > > > > > > > > > > > > > Unfortunately, our project aims to run on older kernels than > > > > > > 5.8 so the bpf > > > > > ring buffer won't work for us. > > > > > > > > > > > > Thanks again > > > > > > > > > > > > Kevin Sheldrake > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: bpf-owner@xxxxxxxxxxxxxxx <bpf-owner@xxxxxxxxxxxxxxx> > On > > > > > Behalf > > > > > > Of Andrii Nakryiko > > > > > > Sent: 20 July 2020 05:35 > > > > > > To: Kevin Sheldrake <Kevin.Sheldrake@xxxxxxxxxxxxx> > > > > > > Cc: bpf@xxxxxxxxxxxxxxx > > > > > > Subject: [EXTERNAL] Re: Maximum size of record over perf ring > buffer? > > > > > > > > > > > > On Fri, Jul 17, 2020 at 7:24 AM Kevin Sheldrake > > > > > <Kevin.Sheldrake@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > Hello > > > > > > > > > > > > > > I'm building a tool using EBPF/libbpf/C and I've run into an > > > > > > > issue that I'd > > > > > like to ask about. I haven't managed to find documentation for > > > > > the maximum size of a record that can be sent over the perf ring > > > > > buffer, but experimentation (on kernel 5.3 (x64) with latest > > > > > libbpf from github) suggests it is just short of 64KB. Please > > > > > could someone confirm if that's the case or not? My experiments > > > > > suggest that sending a record that is greater than 64KB results > > > > > in the size reported in the callback being correct but the > > > > > records overlapping, causing corruption if they are not serviced > > > > > as quickly as they arrive. Setting the record to exactly 64KB results in > no records being received at all. > > > > > > > > > > > > > > For reference, I'm using perf_buffer__new() and > > > > > > > perf_buffer__poll() on > > > > > the userland side; and bpf_perf_event_output(ctx, &event_map, > > > > > BPF_F_CURRENT_CPU, event, sizeof(event_s)) on the EBPF side. > > > > > > > > > > > > > > Additionally, is there a better architecture for sending > > > > > > > large volumes of > > > > > data (>64KB) back from the EBPF program to userland, such as a > > > > > different ring buffer, a map, some kind of shared mmaped > > > > > segment, etc, other than simply fragmenting the data? Please > > > > > excuse my naivety as I'm relatively new to the world of EBPF. > > > > > > > > > > > > > > > > > > > I'm not aware of any such limitations for perf ring buffer and > > > > > > I haven't had a > > > > > chance to validate this. It would be great if you can provide a > > > > > small repro so that someone can take a deeper look, it does > > > > > sound like a bug, if you really get clobbered data. It might be > > > > > actually how you set up perfbuf, AFAIK, it has a mode where it > > > > > will override the data, if it's not consumed quickly enough, but you > need to consciously enable that mode. > > > > > > > > > > > > But apart from that, shameless plug here, you can try the new > > > > > > BPF ring > > > > > buffer ([0]), available in 5.8+ kernels. It will allow you to > > > > > avoid extra copy of data you get with bpf_perf_event_output(), > > > > > if you use BPF ringbuf's > > > > > bpf_ringbuf_reserve() + bpf_ringbuf_commit() API. It also has > > > > > bpf_ringbuf_output() API, which is logically equivalent to > > > > > bpf_perf_event_output(). And it has a very high limit on sample > > > > > size, up to 512MB per sample. > > > > > > > > > > > > Keep in mind, BPF ringbuf is MPSC design and if you use just > > > > > > one BPF > > > > > ringbuf across all CPUs, you might run into some contention > > > > > across multiple CPU. It is acceptable in a lot of applications I > > > > > was targeting, but if you have a high frequency of events (keep > > > > > in mind, throughput doesn't matter, only contention on sample > > > > > reservation matters), you might want to use an array of BPF > > > > > ringbufs to scale throughput. You can do 1 ringbuf per each CPU > > > > > for ultimate performance at the expense of memory usage (that's > > > > > perf ring buffer setup), but BPF ringbuf is flexible enough to > > > > > allow any topology that makes sense for you use case, from 1 shared > ringbuf across all CPUs, to anything in between. > > > > > > > > > > > >