On 8/31/21 2:24 PM, Song Liu wrote:
On Aug 31, 2021, at 9:41 AM, Song Liu <songliubraving@xxxxxx> wrote:
On Aug 31, 2021, at 8:32 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
On Mon, Aug 30, 2021 at 02:41:05PM -0700, Song Liu wrote:
@@ -564,6 +565,18 @@ static void notrace inc_misses_counter(struct bpf_prog *prog)
u64 notrace __bpf_prog_enter(struct bpf_prog *prog)
__acquires(RCU)
{
preempt_disable_notrace();
+#ifdef CONFIG_PERF_EVENTS
+ /* Calling migrate_disable costs two entries in the LBR. To save
+ * some entries, we call perf_snapshot_branch_stack before
+ * migrate_disable to save some entries. This is OK because we
+ * care about the branch trace before entering the BPF program.
+ * If migrate happens exactly here, there isn't much we can do to
+ * preserve the data.
+ */
+ if (prog->call_get_branch)
+ static_call(perf_snapshot_branch_stack)(
+ this_cpu_ptr(&bpf_perf_branch_snapshot));
Here the comment is accurate, but if you recall the calling context
requirements of perf_snapshot_branch_stack from the last patch, you'll
see it requires you have at the very least preemption disabled, which
you just violated.
I think you'll find that (on x86 at least) the suggested
preempt_disable_notrace() incurs no additional branches.
Still, there is the next point to consider...
+#endif
rcu_read_lock();
migrate_disable();
preempt_enable_notrace();
Do we want preempt_enable_notrace() after migrate_disable()? It feels a
little weird to me.
if (unlikely(__this_cpu_inc_return(*(prog->active)) != 1)) {
@@ -1863,9 +1892,23 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
preempt_enable();
}
+DEFINE_PER_CPU(struct perf_branch_snapshot, bpf_perf_branch_snapshot);
+
static __always_inline
void __bpf_trace_run(struct bpf_prog *prog, u64 *args)
{
+#ifdef CONFIG_PERF_EVENTS
+ /* Calling migrate_disable costs two entries in the LBR. To save
+ * some entries, we call perf_snapshot_branch_stack before
+ * migrate_disable to save some entries. This is OK because we
+ * care about the branch trace before entering the BPF program.
+ * If migrate happens exactly here, there isn't much we can do to
+ * preserve the data.
+ */
+ if (prog->call_get_branch)
+ static_call(perf_snapshot_branch_stack)(
+ this_cpu_ptr(&bpf_perf_branch_snapshot));
+#endif
cant_sleep();
In the face of ^^^^^^ the comment makes no sense. Still, what are the
nesting rules for __bpf_trace_run() and __bpf_prog_enter() ? I'm
thinking the trace one can nest inside an occurence of prog, at which
point you have pieces.
I think broken LBR records is something we cannot really avoid in case
of nesting. OTOH, these should be rare cases and will not hurt the results
in most the use cases.
I should probably tighten the rules in verifier to only apply it for
__bpf_prog_enter (where we have the primary use case). We can enable it
for other program types when there are other use cases.
Update about some offline discussion with Alexei and Andrii. We are planning
to move static_call(perf_snapshot_branch_stack) to inside the helper
bpf_get_branch_snapshot. This change has a few benefit:
1. No need for extra check (prog->call_get_branch) before every program (even
when the program doesn't use the helper).
2. No need to duplicate code of different BPF program hook.
3. BPF program always run with migrate_disable(), so it is not necessary to
run add extra preempt_disable_notrace.
It does flushes a few more LBR entries. But the result seems ok:
ID: 0 from intel_pmu_lbr_disable_all+58 to intel_pmu_lbr_disable_all+93
ID: 1 from intel_pmu_lbr_disable_all+54 to intel_pmu_lbr_disable_all+58
ID: 2 from intel_pmu_snapshot_branch_stack+88 to intel_pmu_lbr_disable_all+0
ID: 3 from bpf_get_branch_snapshot+28 to intel_pmu_snapshot_branch_stack+0
ID: 4 from <bpf_tramepoline> to bpf_get_branch_snapshot+0
ID: 5 from <bpf_tramepoline> to <bpf_tramepoline>
ID: 6 from __bpf_prog_enter+34 to <bpf_tramepoline>
ID: 7 from migrate_disable+60 to __bpf_prog_enter+9
ID: 8 from __bpf_prog_enter+4 to migrate_disable+0
If we make migrate_disable 'static inline' it will save these 2 entries.
It's probably worth doing regardless, since it will be immediate
performance benefit for all bpf programs.
ID: 9 from __bpf_prog_enter+4 to __bpf_prog_enter+0
ID: 10 from bpf_fexit_loop_test1+22 to __bpf_prog_enter+0
ID: 11 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 12 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 13 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 14 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
ID: 15 from bpf_fexit_loop_test1+20 to bpf_fexit_loop_test1+13
We can save more by inlining intel_pmu_lbr_disable_all(). But it is probably
not necessary at the moment.
Thanks,
Song