Re: [PATCH bpf-next] bpf: Add bpf_read_raw_record() helper

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 25 Aug 2022 16:09:41 -0700

On Thu, Aug 25, 2022 at 3:13 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Thu, Aug 25, 2022 at 2:34 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Tue, Aug 23, 2022 at 2:04 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> > >
> > > The helper is for BPF programs attached to perf_event in order to read
> > > event-specific raw data.  I followed the convention of the
> > > bpf_read_branch_records() helper so that it can tell the size of
> > > record using BPF_F_GET_RAW_RECORD flag.
> > >
> > > The use case is to filter perf event samples based on the HW provided
> > > data which have more detailed information about the sample.
> > >
> > > Note that it only reads the first fragment of the raw record.  But it
> > > seems mostly ok since all the existing PMU raw data have only single
> > > fragment and the multi-fragment records are only for BPF output attached
> > > to sockets.  So unless it's used with such an extreme case, it'd work
> > > for most of tracing use cases.
> > >
> > > Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
> > > ---
> > > I don't know how to test this.  As the raw data is available on some
> > > hardware PMU only (e.g. AMD IBS).  I tried a tracepoint event but it was
> > > rejected by the verifier.  Actually it needs a bpf_perf_event_data
> > > context so that's not an option IIUC.
> > >
> > >  include/uapi/linux/bpf.h | 23 ++++++++++++++++++++++
> > >  kernel/trace/bpf_trace.c | 41 ++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 64 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 934a2a8beb87..af7f70564819 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -5355,6 +5355,23 @@ union bpf_attr {
> > >   *     Return
> > >   *             Current *ktime*.
> > >   *
> > > + * long bpf_read_raw_record(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
> > > + *     Description
> > > + *             For an eBPF program attached to a perf event, retrieve the
> > > + *             raw record associated to *ctx* and store it in the buffer
> > > + *             pointed by *buf* up to size *size* bytes.
> > > + *     Return
> > > + *             On success, number of bytes written to *buf*. On error, a
> > > + *             negative value.
> > > + *
> > > + *             The *flags* can be set to **BPF_F_GET_RAW_RECORD_SIZE** to
> > > + *             instead return the number of bytes required to store the raw
> > > + *             record. If this flag is set, *buf* may be NULL.
> >
> > It looks pretty ugly from a usability standpoint to have one helper
> > doing completely different things and returning two different values
> > based on BPF_F_GET_RAW_RECORD_SIZE.
>
> Agreed.
>
> >
> > I'm not sure what's best, but I have two alternative proposals:
> >
> > 1. Add two helpers: one to get perf record information (and size will
> > be one of them). Something like bpf_perf_record_query(ctx, flags)
> > where you pass perf ctx and what kind of information you want to read
> > (through flags), and u64 return result returns that (see
> > bpf_ringbuf_query() for such approach). And then have separate helper
> > to read data.
>
> I like this as I want to have more info for the perf event sample like
> instruction address or sample type.  I know some of the info is
> available through the context but I think this is a better approach.
>
> >
> > 2. Keep one helper, but specify that it always returns record size,
> > even if user specified smaller size to read. And then allow passing
> > buf==NULL && size==0. So passing NULL, 0 -- you get record size.
> > Passing non-NULL buf -- you read data.
> >
> >
> > And also, "read_raw_record" is way too generic. We have
> > bpf_perf_prog_read_value(), let's use "bpf_perf_read_raw_record()" as
> > a name. We should have called bpf_read_branch_records() as
> > bpf_perf_read_branch_records(), probably, as well. But it's too late.
>
> Yeah, what about this?
>
>  * bpf_perf_event_query(ctx, flag)
>  * bpf_perf_event_get(ctx, flag, buf, size)
>
> Maybe we can use the same flag for both.  Like BPF_PERF_RAW_RECORD
> can return the size (or -1 if not) on _query() and read the data on _get().
> Or we can have a BPF_PERF_RAW_RECORD_SIZE only for _query().
> It seems we don't need _get() for things like BPF_PERF_SAMPLE_IP.
> What do you think?
>

probably separate flags makes more sense, because I can see how we can
allow querying multiple up-to-u64-sized things (even sample_ip, for
example), while reserve bpf_perf_event_get() to variable-size or big
sized values.

Not super keen on "get", "read" or "load" is the verb we typically use
for such operations, but we already have bpf_perf_event_read(). And
bpf_perf_event_load() doesn't seem right either. Naming is hard. Maybe
"fetch"? Or just "get". Don't know, but maybe someone has good ideas.

> Thanks,
> Namhyung