Re: perf event and data_sz

Yonghong Song <yhs@xxxxxx> · Thu, 27 Aug 2020 09:04:48 -0700

On 8/27/20 3:01 AM, Borna Cafuk wrote:
On Wed, Aug 26, 2020 at 8:45 PM Yonghong Song <yhs@xxxxxx> wrote:

On 8/26/20 7:11 AM, Borna Cafuk wrote:
Hi everyone,

When examining BPF programs that use perf buffers, I have noticed that
the `data_sz` parameter in the sample callback has a different value than
the size passed to `bpf_perf_event_output`.

Raw samples are padded in `perf_prepare_sample` so their size is a multiple of
`sizeof(u64)` (see [0]). The sample includes the size header, a `u32`.
The size stored in `raw->size` is then size of the sample, minus the 4 bytes for
the size header. This size, however, includes both the data and the padding.

What I find confusing is that this size including the padding is eventually
passed as `data_sz` to the callback in the userspace, instead of
the size of the data that was passed as an argument to `bpf_perf_event_output`.

Is this intended behaviour?

  From the kernel source code, yes, this is expected behavior. What you
described below matches what the kernel did. So raw->size = 68 is expected.

I understand why this size that is stored in `raw->size` is needed.
What I don't see is how the value is of any use in the callback.

I have a use-case for getting only the size of the data in the
userspace, could this be done?

In this case, since we know the kernel writes one record at a time,
you check the size, it is 68 more than 62, you just read 62 bytes
as your real data, ignore the rest as the padding. Does this work?

The `data_sz` parameter seems a little pointless, then. What is its purpose
in the callback if it doesn't accurately represent the size of the data?

I agree since this is not exactly what user expected, user space will
need to calculate the "expected" data_sz and based on that to get to
what kind of record it is. Otherwise, the user needs to put some
info in the record itself to differentiate what it is and your 
intermediate layer just pass all data from the kernel to the application.

bcc callback passed the the buffer with raw->size to application.
But applications are expected to know what the record layout is...

I'm afraid that wouldn't work for the use-case, our application should be able
to read the raw data without having to know the record layout. It has to be
generic, we handle interpreting the records elsewhere and at another time.

I am added Peter Zijlstra in the cc list and hope he may have some
insight on how to tackle this.

To demonstrate, I have prepared a minimal example by modifying
BCC's filelife example. It uses a kprobe on vfs_unlink to print some sizes
every time a file is unlinked. The sizes are:
   * the `sizeof(struct event)` measured in the userspace program,
   * the `sizeof(struct event)` measured in the BPF program, and
   * the `data_sz` parameter.

The first two are 62, as expected, but `data_sz` is 68.
The 62 bytes of the struct and the 4 bytes of the sample header make 66 bytes.
This is rounded up to the first multiple of 8, which is 72.
The 4 bytes for the size header are then subtracted,
and 68 is written as the data size.

Any input is much appreciated,

Best regards,
Borna Cafuk

[0] https://github.com/torvalds/linux/blob/6a9dc5fd6170d0a41c8a14eb19e63d94bea5705a/kernel/events/core.c#L7035

example.h
--------------------------------
#ifndef __EXAMPLE_H
#define __EXAMPLE_H

struct __attribute__((__packed__)) event {
      __u16 size;
      char filler[60];
};

#endif /* __EXAMPLE_H */

example.bpf.c
--------------------------------
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "example.h"

struct {
      __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
      __uint(key_size, sizeof(u32));
      __uint(value_size, sizeof(u32));
} events SEC(".maps");

SEC("kprobe/vfs_unlink")
int BPF_KPROBE(kprobe__vfs_unlink, struct inode *dir, struct dentry *dentry)
{
      struct event event = {};
      event.size = sizeof(struct event);

      bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
                            &event, sizeof(struct event));
      return 0;
}

char LICENSE[] SEC("license") = "GPL";

example.c
--------------------------------
#include <stdio.h>
#include <bpf/libbpf.h>
#include <sys/resource.h>
#include "example.h"
#include "example.skel.h"

#define PERF_BUFFER_PAGES    16
#define PERF_POLL_TIMEOUT_MS    100

void handle_event(void *ctx, int cpu, void *data, __u32 data_sz)
{
      const struct event *e = data;

      printf("Userspace: %u | BPF: %zu | data_sz: %u \n",
             e->size, sizeof(struct event), data_sz);
}

void handle_lost_events(void *ctx, int cpu, __u64 lost_cnt)
{
      fprintf(stderr, "lost %llu events on CPU #%d\n", lost_cnt, cpu);
}

int main(int argc, char **argv)
{
      struct perf_buffer_opts pb_opts;
      struct perf_buffer *pb = NULL;
      struct example_bpf *obj;
      int err;

      struct rlimit rlim_new = {
          .rlim_cur    = RLIM_INFINITY,
          .rlim_max    = RLIM_INFINITY,
      };
      err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
      if (err) {
          fprintf(stderr, "failed to increase rlimit: %d\n", err);
          return 1;
      }

      obj = example_bpf__open();
      if (!obj) {
          fprintf(stderr, "failed to open and/or load BPF object\n");
          return 1;
      }

      err = example_bpf__load(obj);
      if (err) {
          fprintf(stderr, "failed to load BPF object: %d\n", err);
          goto cleanup;
      }

      err = example_bpf__attach(obj);
      if (err) {
          fprintf(stderr, "failed to attach BPF programs\n");
          goto cleanup;
      }

      pb_opts.sample_cb = handle_event;
      pb_opts.lost_cb = handle_lost_events;
      pb = perf_buffer__new(bpf_map__fd(obj->maps.events), PERF_BUFFER_PAGES,
                            &pb_opts);
      err = libbpf_get_error(pb);
      if (err) {
          pb = NULL;
          fprintf(stderr, "failed to open perf buffer: %d\n", err);
          goto cleanup;
      }

      while ((err = perf_buffer__poll(pb, PERF_POLL_TIMEOUT_MS)) >= 0)
          ;
      fprintf(stderr, "error polling perf buffer: %d\n", err);

      cleanup:
      perf_buffer__free(pb);
      example_bpf__destroy(obj);

      return err != 0;
}