Hi Peter, On Mon, Jan 09, 2023 at 01:14:31PM +0100, Peter Zijlstra wrote: > On Thu, Dec 29, 2022 at 12:41:00PM -0800, Namhyung Kim wrote: > > So I like the general idea; I just think it's turned into a bit of a > mess. That is code is already overly branchy which is known to hurt > performance, we should really try and not make it worse than absolutely > needed. Agreed. > > > kernel/events/core.c | 86 ++++++++++++++++++++++++++++++++------------ > > 1 file changed, 63 insertions(+), 23 deletions(-) > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index eacc3702654d..70bff8a04583 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -7582,14 +7582,21 @@ void perf_prepare_sample(struct perf_event_header *header, > > filtered_sample_type = sample_type & ~data->sample_flags; > > __perf_event_header__init_id(header, data, event, filtered_sample_type); > > > > - if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) > > - data->ip = perf_instruction_pointer(regs); > > + if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) { > > + /* attr.sample_type may not have PERF_SAMPLE_IP */ > > Right, but that shouldn't matter, IIRC its OK to have more bits set in > data->sample_flags than we have set in attr.sample_type. It just means > we have data available for sample types we're (possibly) not using. > > That is, I think you can simply write this like: > > > + if (!(data->sample_flags & PERF_SAMPLE_IP)) { > > + data->ip = perf_instruction_pointer(regs); > > + data->sample_flags |= PERF_SAMPLE_IP; > > + } > > + } > > if (filtered_sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE)) { > data->ip = perf_instruction_pointer(regs); > data->sample_flags |= PERF_SAMPLE_IP); > } > > ... > > if (filtered_sample_type & PERF_SAMPLE_CODE_PAGE_SIZE) { > data->code_page_size = perf_get_page_size(data->ip); > data->sample_flags |= PERF_SAMPLE_CODE_PAGE_SIZE; > } > > Then after a single perf_prepare_sample() run we have: > > pre | post > ---------------------------------------- > 0 | 0 > IP | IP > CODE_PAGE_SIZE | IP|CODE_PAGE_SIZE > IP|CODE_PAGE_SIZE | IP|CODE_PAGE_SIZE > > So while data->sample_flags will have an extra bit set in the 3rd case, > that will not affect perf_sample_outout() which only looks at data->type > (== attr.sample_type). > > And since data->sample_flags will have both bits set, a second run will > filter out both and avoid the extra work (except doing that will mess up > the branch predictors). Yeah, it'd be better to check filtered_sample_type in the first place. Btw, I was thinking about a hypothetical scenario that IP set by a PMU driver not from the regs. In this case, having CODE_PAGE_SIZE will overwrite the IP. I don't think we need to worry about that for now since PMU drivers updates the regs (using set_linear_ip). But it seems like a possible scenario for something like PEBS or IBS. > > > > if (sample_type & PERF_SAMPLE_CALLCHAIN) { > > int size = 1; > > > > - if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) > > + if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) { > > data->callchain = perf_callchain(event, regs); > > + data->sample_flags |= PERF_SAMPLE_CALLCHAIN; > > + } > > > > size += data->callchain->nr; > > > > This, why can't this be: > > if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) { > data->callchain = perf_callchain(event, regs); > data->sample_flags |= PERF_SAMPLE_CALLCHAIN; > > header->size += (1 + data->callchain->nr) * sizeof(u64); > } > > I suppose this is because perf_event_header lives on the stack of the > overflow handler and all that isn't available / relevant for the BPF > thing. Right, it needs to calculate the data size for each sample data. > > And we can't pull that out into anther function without adding yet > another branch fest. > > However; inspired by your next patch; we can do something like so: > > if (filtered_sample_type & PERF_SAMPLE_CALLCHAIN) { > data->callchain = perf_callchain(event, regs); > data->sample_flags |= PERF_SAMPLE_CALLCHAIN; > > data->size += (1 + data->callchain->nr) * sizeof(u64); > } This is fine as long as all other places (like in PMU drivers) set the callchain update the sample data size accordingly. If not, we can get the callchain but the data size will be wrong. > > And then have __perf_event_output() (or something thereabout) do: > > perf_prepare_sample(data, event, regs); > perf_prepare_header(&header, data, event); > err = output_begin(&handle, data, event, header.size); > if (err) > goto exit; > perf_output_sample(&handle, &header, data, event); > perf_output_end(&handle); > > With perf_prepare_header() being something like: > > header->type = PERF_RECORD_SAMPLE; > header->size = sizeof(*header) + event->header_size + data->size; > header->misc = perf_misc_flags(regs); > ... > > Hmm ? > > (same for all the other sites) Looks good. But I'm confused by the tip-bot2 messages saying it's merged. Do you want me to work on it as a follow up? Thanks, Namhyung