Re: [PATCH bpf-next] bpf: Add bpf_read_raw_record() helper

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 25 Aug 2022 14:33:56 -0700

On Tue, Aug 23, 2022 at 2:04 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> The helper is for BPF programs attached to perf_event in order to read
> event-specific raw data.  I followed the convention of the
> bpf_read_branch_records() helper so that it can tell the size of
> record using BPF_F_GET_RAW_RECORD flag.
>
> The use case is to filter perf event samples based on the HW provided
> data which have more detailed information about the sample.
>
> Note that it only reads the first fragment of the raw record.  But it
> seems mostly ok since all the existing PMU raw data have only single
> fragment and the multi-fragment records are only for BPF output attached
> to sockets.  So unless it's used with such an extreme case, it'd work
> for most of tracing use cases.
>
> Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
> ---
> I don't know how to test this.  As the raw data is available on some
> hardware PMU only (e.g. AMD IBS).  I tried a tracepoint event but it was
> rejected by the verifier.  Actually it needs a bpf_perf_event_data
> context so that's not an option IIUC.
>
>  include/uapi/linux/bpf.h | 23 ++++++++++++++++++++++
>  kernel/trace/bpf_trace.c | 41 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 934a2a8beb87..af7f70564819 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5355,6 +5355,23 @@ union bpf_attr {
>   *     Return
>   *             Current *ktime*.
>   *
> + * long bpf_read_raw_record(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags)
> + *     Description
> + *             For an eBPF program attached to a perf event, retrieve the
> + *             raw record associated to *ctx* and store it in the buffer
> + *             pointed by *buf* up to size *size* bytes.
> + *     Return
> + *             On success, number of bytes written to *buf*. On error, a
> + *             negative value.
> + *
> + *             The *flags* can be set to **BPF_F_GET_RAW_RECORD_SIZE** to
> + *             instead return the number of bytes required to store the raw
> + *             record. If this flag is set, *buf* may be NULL.

It looks pretty ugly from a usability standpoint to have one helper
doing completely different things and returning two different values
based on BPF_F_GET_RAW_RECORD_SIZE.

I'm not sure what's best, but I have two alternative proposals:

1. Add two helpers: one to get perf record information (and size will
be one of them). Something like bpf_perf_record_query(ctx, flags)
where you pass perf ctx and what kind of information you want to read
(through flags), and u64 return result returns that (see
bpf_ringbuf_query() for such approach). And then have separate helper
to read data.

2. Keep one helper, but specify that it always returns record size,
even if user specified smaller size to read. And then allow passing
buf==NULL && size==0. So passing NULL, 0 -- you get record size.
Passing non-NULL buf -- you read data.

And also, "read_raw_record" is way too generic. We have
bpf_perf_prog_read_value(), let's use "bpf_perf_read_raw_record()" as
a name. We should have called bpf_read_branch_records() as
bpf_perf_read_branch_records(), probably, as well. But it's too late.

> + *
> + *             **-EINVAL** if arguments invalid or **size** not a multiple
> + *             of **sizeof**\ (u64\ ).
> + *
> + *             **-ENOENT** if the event does not have raw records.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5566,6 +5583,7 @@ union bpf_attr {
>         FN(tcp_raw_check_syncookie_ipv4),       \
>         FN(tcp_raw_check_syncookie_ipv6),       \
>         FN(ktime_get_tai_ns),           \
> +       FN(read_raw_record),            \
>         /* */
>

[...]