Re: [PATCH bpf-next] bpf: Add bpf_read_raw_record() helper

Namhyung Kim <namhyung@xxxxxxxxxx> · Thu, 25 Aug 2022 15:13:28 -0700

Hello,

On Thu, Aug 25, 2022 at 2:34 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Tue, Aug 23, 2022 at 2:04 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >
> > The helper is for BPF programs attached to perf_event in order to read
> > event-specific raw data.  I followed the convention of the
> > bpf_read_branch_records() helper so that it can tell the size of
> > record using BPF_F_GET_RAW_RECORD flag.
> >
> > The use case is to filter perf event samples based on the HW provided
> > data which have more detailed information about the sample.
> >
> > Note that it only reads the first fragment of the raw record.  But it
> > seems mostly ok since all the existing PMU raw data have only single
> > fragment and the multi-fragment records are only for BPF output attached
> > to sockets.  So unless it's used with such an extreme case, it'd work
> > for most of tracing use cases.
> >
> > Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
> > ---
> > I don't know how to test this.  As the raw data is available on some
> > hardware PMU only (e.g. AMD IBS).  I tried a tracepoint event but it was
> > rejected by the verifier.  Actually it needs a bpf_perf_event_data
> > context so that's not an option IIUC.
> >
> >  include/uapi/linux/bpf.h | 23 ++++++++++++++++++++++
> >  kernel/trace/bpf_trace.c | 41 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 934a2a8beb87..af7f70564819 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5355,6 +5355,23 @@ union bpf_attr {
> >   *     Return
> >   *             Current *ktime*.
> >   *
> > + * long bpf_read_raw_record(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
> > + *     Description
> > + *             For an eBPF program attached to a perf event, retrieve the
> > + *             raw record associated to *ctx* and store it in the buffer
> > + *             pointed by *buf* up to size *size* bytes.
> > + *     Return
> > + *             On success, number of bytes written to *buf*. On error, a
> > + *             negative value.
> > + *
> > + *             The *flags* can be set to **BPF_F_GET_RAW_RECORD_SIZE** to
> > + *             instead return the number of bytes required to store the raw
> > + *             record. If this flag is set, *buf* may be NULL.
>
> It looks pretty ugly from a usability standpoint to have one helper
> doing completely different things and returning two different values
> based on BPF_F_GET_RAW_RECORD_SIZE.

Agreed.

>
> I'm not sure what's best, but I have two alternative proposals:
>
> 1. Add two helpers: one to get perf record information (and size will
> be one of them). Something like bpf_perf_record_query(ctx, flags)
> where you pass perf ctx and what kind of information you want to read
> (through flags), and u64 return result returns that (see
> bpf_ringbuf_query() for such approach). And then have separate helper
> to read data.

I like this as I want to have more info for the perf event sample like
instruction address or sample type.  I know some of the info is
available through the context but I think this is a better approach.

>
> 2. Keep one helper, but specify that it always returns record size,
> even if user specified smaller size to read. And then allow passing
> buf==NULL && size==0. So passing NULL, 0 -- you get record size.
> Passing non-NULL buf -- you read data.
>
>
> And also, "read_raw_record" is way too generic. We have
> bpf_perf_prog_read_value(), let's use "bpf_perf_read_raw_record()" as
> a name. We should have called bpf_read_branch_records() as
> bpf_perf_read_branch_records(), probably, as well. But it's too late.

Yeah, what about this?

 * bpf_perf_event_query(ctx, flag)
 * bpf_perf_event_get(ctx, flag, buf, size)

Maybe we can use the same flag for both.  Like BPF_PERF_RAW_RECORD
can return the size (or -1 if not) on _query() and read the data on _get().
Or we can have a BPF_PERF_RAW_RECORD_SIZE only for _query().
It seems we don't need _get() for things like BPF_PERF_SAMPLE_IP.
What do you think?

Thanks,
Namhyung