On Thu, Aug 27, 2020 at 3:02 AM Borna Cafuk <borna.cafuk@xxxxxxxxxx> wrote: > > On Wed, Aug 26, 2020 at 8:45 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 8/26/20 7:11 AM, Borna Cafuk wrote: > > > Hi everyone, > > > > > > When examining BPF programs that use perf buffers, I have noticed that > > > the `data_sz` parameter in the sample callback has a different value than > > > the size passed to `bpf_perf_event_output`. > > > > > > Raw samples are padded in `perf_prepare_sample` so their size is a multiple of > > > `sizeof(u64)` (see [0]). The sample includes the size header, a `u32`. > > > The size stored in `raw->size` is then size of the sample, minus the 4 bytes for > > > the size header. This size, however, includes both the data and the padding. > > > > > > What I find confusing is that this size including the padding is eventually > > > passed as `data_sz` to the callback in the userspace, instead of > > > the size of the data that was passed as an argument to `bpf_perf_event_output`. > > > > > > Is this intended behaviour? > > > > From the kernel source code, yes, this is expected behavior. What you > > described below matches what the kernel did. So raw->size = 68 is expected. > > I understand why this size that is stored in `raw->size` is needed. > What I don't see is how the value is of any use in the callback. > > > > > > > > > I have a use-case for getting only the size of the data in the > > > userspace, could this be done? > > > > In this case, since we know the kernel writes one record at a time, > > you check the size, it is 68 more than 62, you just read 62 bytes > > as your real data, ignore the rest as the padding. Does this work? > > The `data_sz` parameter seems a little pointless, then. What is its purpose > in the callback if it doesn't accurately represent the size of the data? > > > > > bcc callback passed the the buffer with raw->size to application. > > But applications are expected to know what the record layout is... > > I'm afraid that wouldn't work for the use-case, our application should be able > to read the raw data without having to know the record layout. It has to be > generic, we handle interpreting the records elsewhere and at another time. I agree it's confusing and less useful (not exactly useless, though), but seems like PERF_SAMPLE_RAW doesn't store neither original size nor (equivalently) the added padding size, so libbpf has nothing to go off of, unfortunately. I can only offer two suggestions: 1. Make sure samples you are emitting are 8 byte aligned, in which case data_sz will always match actual data size. If you need different sizes for different structs, you'd have to artificially add extra bytes, though. 2. Consider using BPF ringbuf, it's data_sz is exact and doesn't include padding (you need 5.8+ kernel). > > > > > > > > > To demonstrate, I have prepared a minimal example by modifying > > > BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes > > > every time a file is unlinked. The sizes are: > > > * the `sizeof(struct event)` measured in the userspace program, > > > * the `sizeof(struct event)` measured in the BPF program, and > > > * the `data_sz` parameter. > > > > > > The first two are 62, as expected, but `data_sz` is 68. > > > The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes. > > > This is rounded up to the first multiple of 8, which is 72. > > > The 4 bytes for the size header are then subtracted, > > > and 68 is written as the data size. > > > > > > Any input is much appreciated, > > > > > > Best regards, > > > Borna Cafuk > > > > > > > > > [0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035 > > > > > > > > > example.h [...]