> On Aug 25, 2022, at 4:03 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Thu, Aug 25, 2022 at 3:08 PM Song Liu <songliubraving@xxxxxx> wrote: >> >> >> >>> On Aug 25, 2022, at 2:33 PM, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: >>> >>> On Tue, Aug 23, 2022 at 2:04 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote: >>>> >>>> The helper is for BPF programs attached to perf_event in order to read >>>> event-specific raw data. I followed the convention of the >>>> bpf_read_branch_records() helper so that it can tell the size of >>>> record using BPF_F_GET_RAW_RECORD flag. >>>> >>>> The use case is to filter perf event samples based on the HW provided >>>> data which have more detailed information about the sample. >>>> >>>> Note that it only reads the first fragment of the raw record. But it >>>> seems mostly ok since all the existing PMU raw data have only single >>>> fragment and the multi-fragment records are only for BPF output attached >>>> to sockets. So unless it's used with such an extreme case, it'd work >>>> for most of tracing use cases. >>>> >>>> Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx> >>>> --- >>>> I don't know how to test this. As the raw data is available on some >>>> hardware PMU only (e.g. AMD IBS). I tried a tracepoint event but it was >>>> rejected by the verifier. Actually it needs a bpf_perf_event_data >>>> context so that's not an option IIUC. >>>> >>>> include/uapi/linux/bpf.h | 23 ++++++++++++++++++++++ >>>> kernel/trace/bpf_trace.c | 41 ++++++++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 64 insertions(+) >>>> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h >>>> index 934a2a8beb87..af7f70564819 100644 >>>> --- a/include/uapi/linux/bpf.h >>>> +++ b/include/uapi/linux/bpf.h >>>> @@ -5355,6 +5355,23 @@ union bpf_attr { >>>> * Return >>>> * Current *ktime*. >>>> * >>>> + * long bpf_read_raw_record(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags) >>>> + * Description >>>> + * For an eBPF program attached to a perf event, retrieve the >>>> + * raw record associated to *ctx* and store it in the buffer >>>> + * pointed by *buf* up to size *size* bytes. >>>> + * Return >>>> + * On success, number of bytes written to *buf*. On error, a >>>> + * negative value. >>>> + * >>>> + * The *flags* can be set to **BPF_F_GET_RAW_RECORD_SIZE** to >>>> + * instead return the number of bytes required to store the raw >>>> + * record. If this flag is set, *buf* may be NULL. >>> >>> It looks pretty ugly from a usability standpoint to have one helper >>> doing completely different things and returning two different values >>> based on BPF_F_GET_RAW_RECORD_SIZE. >> >> Yeah, I had the same thought when I first looked at it. But that's the >> exact syntax with bpf_read_branch_records(). Well, we still have time >> to fix the new helper.. >> >>> >>> I'm not sure what's best, but I have two alternative proposals: >>> >>> 1. Add two helpers: one to get perf record information (and size will >>> be one of them). Something like bpf_perf_record_query(ctx, flags) >>> where you pass perf ctx and what kind of information you want to read >>> (through flags), and u64 return result returns that (see >>> bpf_ringbuf_query() for such approach). And then have separate helper >>> to read data. >>> >>> 2. Keep one helper, but specify that it always returns record size, >>> even if user specified smaller size to read. And then allow passing >>> buf==NULL && size==0. So passing NULL, 0 -- you get record size. >>> Passing non-NULL buf -- you read data. >> >> AFAICT, this is also confusing. >> > > this is analogous to snprintf() behavior, so not that new and > surprising when you think about it. But if query + read makes more > sense, then it's fine by me Given the name discussion (the other email), I now like one API better. Actually, since we are on this, can we make it more generic, and handle all possible PERF_SAMPLE_* (in enum perf_event_sample_format)? Something like: long bpf_perf_event_read_sample(void *ctx, void *buf, u64 size, u64 flags); WDYT Namhyung? Another idea is to add another parameter, so that we can pick which PERF_SAMPLE_* to output via bpf_perf_event_read_sample(). I think this will cover all cases with sample perf_event. Thoughts? Thanks, Song > >> Maybe we should use two kfuncs for this? >> >> Thanks, >> Song >> >>> >>> >>> And also, "read_raw_record" is way too generic. We have >>> bpf_perf_prog_read_value(), let's use "bpf_perf_read_raw_record()" as >>> a name. We should have called bpf_read_branch_records() as >>> bpf_perf_read_branch_records(), probably, as well. But it's too late. >>> >>>> + * >>>> + * **-EINVAL** if arguments invalid or **size** not a multiple >>>> + * of **sizeof**\ (u64\ ). >>>> + * >>>> + * **-ENOENT** if the event does not have raw records. >>>> */ >>>> #define __BPF_FUNC_MAPPER(FN) \ >>>> FN(unspec), \ >>>> @@ -5566,6 +5583,7 @@ union bpf_attr { >>>> FN(tcp_raw_check_syncookie_ipv4), \ >>>> FN(tcp_raw_check_syncookie_ipv6), \ >>>> FN(ktime_get_tai_ns), \ >>>> + FN(read_raw_record), \ >>>> /* */ >>>> >>> >>> [...] >>