Re: [RFC bpf-next] bpf: Use prog->active instead of bpf_prog_active for kprobe_multi

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 10 Jun 2022 10:58:50 -0700

On Thu, Jun 9, 2022 at 3:03 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Jun 9, 2022 at 11:27 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Tue, Jun 7, 2022 at 9:29 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Tue, May 31, 2022 at 4:24 PM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, May 25, 2022 at 4:40 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> > > > >
> > > > > hi,
> > > > > Alexei suggested to use prog->active instead global bpf_prog_active
> > > > > for programs attached with kprobe multi [1].
> > > > >
> > > > > AFAICS this will bypass bpf_disable_instrumentation, which seems to be
> > > > > ok for some places like hash map update, but I'm not sure about other
> > > > > places, hence this is RFC post.
> > > > >
> > > > > I'm not sure how are kprobes different to trampolines in this regard,
> > > > > because trampolines use prog->active and it's not a problem.
> > > > >
> > > > > thoughts?
> > > > >
> > > >
> > > > Let's say we have two kernel functions A and B? B can be called from
> > > > BPF program though some BPF helper, ok? Now let's say I have two BPF
> > > > programs kprobeX and kretprobeX, both are attached to A and B. With
> > > > using prog->active instead of per-cpu bpf_prog_active, what would be
> > > > the behavior when A is called somewhere in the kernel.
> > > >
> > > > 1. A is called
> > > > 2. kprobeX is activated for A, calls some helper which eventually calls B
> > > >   3. kprobeX is attempted to be called for B, but is skipped due to prog->active
> > > >   4. B runs
> > > >   5. kretprobeX is activated for B, calls some helper which eventually calls B
> > > >     6. kprobeX is ignored (prog->active > 0)
> > > >     7. B runs
> > > >     8. kretprobeX is ignored (prog->active > 0)
> > > > 9. kretprobeX is activated for A, calls helper which calls B
> > > >   10. kprobeX is activated for B
> > > >     11. kprobeX is ignored (prog->active > 0)
> > >
> > > not correct. kprobeX actually runs.
> > > but the end result is correct.
> > >
> >
> > Right, it was a long sequence, but you got the idea :)
> >
> > > >     12. B runs
> > > >     13. kretprobeX is ignored (prog->active > 0)
> > > >   14. B runs
> > > >   15. kretprobeX is ignored (prog->active > 0)
> > > >
> > > >
> > > > If that's correct, we get:
> > > >
> > > > 1. kprobeX for A
> > > > 2. kretprobeX for B
> > > > 3. kretprobeX for A
> > > > 4. kprobeX for B
> > >
> > > Here it's correct.
> > >
> > > > It's quite mind-boggling and annoying in practice. I'd very much
> > > > prefer just kprobeX for A followed by kretprobeX for A. That's it.
> > > >
> > > > I'm trying to protect against this in retsnoop with custom per-cpu
> > > > logic in each program, but I so much more prefer bpf_prog_active,
> > > > which basically says "no nested kprobe calls while kprobe program is
> > > > running", which makes a lot of sense in practice.
> > >
> > > It makes sense for retsnoop, but does not make sense in general.
> > >
> > > > Given kprobe already used global bpf_prog_active I'd say multi-kprobe
> > > > should stick to bpf_prog_active as well.
> > >
> > > I strongly disagree.
> > > Both multi kprobe and kprobe should move to per prog counter
> > > plus some other protection
> > > (we cannot just move to per-prog due to syscalls).
> > > It's true that the above order is mind-boggling,
> > > but it's much better than
> > > missing kprobe invocation completely just because
> > > another kprobe is running on the same cpu.
> > > People complained numerous times about this kprobe behavior.
> > > kprobeX attached to A
> > > kprobeY attached to B.
> > > If kprobeX calls B kprobeY is not going to be called.
> > > Means that anything that bpf is using is lost.
> > > spin locks, lists, rcu, etc.
> > > Sleepable uprobes are coming.
> > > iirc Delyan's patch correctly.
> > > We will do migrate_disable and inc bpf_prog_active.
> >
> > This might be a different issue, I'm not sure why uprobes should be
> > protected by the same global bpf_prog_active, you can't trigger uprobe
> > from uprobe program. And especially for sleepable programs it makes no
> > sense to use per-CPU protection (we have bpf_run_ctx for such
> > protections, if needed).
>
> you're forgetting about tracepoints and perf_events.
> 'perf record' will capture everything.
> Whereas bpf powered 'perf record' will not see bpf progs
> attached to [ku]probe, tp, events.

I agree that using *the same bundled together* bpf_prog_active for
kprobes, uprobes, tracepoints and perf_events doesn't make sense. Can
we think about a bit more nuanced approach here? E.g., for perf_events
it seems unlikely to go into reentrancy and they have quite different
"mode of operation" than kprobes, so per-program protection makes more
sense to me there. Similarly for uprobes, as I mentioned.

But for kprobes its very common to use pairs of kprobe and kretprobe
together. And the sequence I mentioned above is already very
confusing, and if you think about two independent application that
attach pairs of kprobe+kretprobe independently to the same subset of
functions, their interaction will be just plain weird and bug-like.
Specifically for kprobes, pretending like kprobe BPF program doesn't
call any internals of the kernel and doesn't trigger any nested
kprobes seems like a sane thing to me. Surely from kernel POV just
doing per-program (and per-CPU!) check is simple and "elegant" in
terms of implementation, but it's just shifting burden to all kprobe
users. I'm not sure that's the right trade off in this case.

I'm less clear about whether tracepoint should share protection with
kprobe, tbh. On one hand they have same reentrancy problems (though
much less so), but they are also not used in coupled pairs as
kprobe+kretprobe is typically used, so per-program protection for TP
might be ok.

>
> > > Now random kprobes on that cpu will be lost.
> >
> > It's not random. The rule is you can't kernel functions and
> > tracepoints triggered from BPF kprobes/tracepoints. This prevents
> > nasty reentrance problems and makes sense.
>
> From the user point of view it makes no sense whatsoever.
> bperf is losing info.
> Try explaining that to users.
>

See above, I agree that perf_event should not be ignored if it happens
to NMI-interrupt some kprobe BPF program, but I think it's a bit of a
different problem (just like uprobe).

> > Isn't kernel tracing infra
> > is protecting itself similarly, preventing reentrancy and recursion?
>
> Yes and that's what it should be doing.
> It's not the job of the kernel to implement the policy.
> "run one bpf prog per cpu at a time" is a policy
> and very much a broken one.

Plain "one bpf prog per cpu" is bad policy, yes, but all-or-nothing
policy for kprobes/kretprobes attached to the same kernel function
seems to make a lot of sense to me

>
> > > It's awful. We have to fix it.
> >
> > You can call it "a fix" if you'd like, but it's changing a very
> > user-visible behavior and guarantees on which users relied for a
> > while. So even if we switch to per-prog protection it will have to be
> > some sort of opt-in (flag, new program type, whatever).
>
> No opt-in allowed for fixes and it's a bug fix.
> No one should rely on broken kernel behavior.
> If retsnoop rely on that it's really retsnoop's fault.

No point in arguing if we can't even agree on whether this is a bug or
not, sorry.

Getting kretprobe invocation out of the blue without getting
corresponding kprobe invocation first (both of which were successfully
attached) seems like more of a bug to me. But perhaps that's a matter
of subjective opinion.