Re: [RFC bpf-next] bpf: Use prog->active instead of bpf_prog_active for kprobe_multi

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Sat, 11 Jun 2022 13:53:26 -0700

On Fri, Jun 10, 2022 at 10:58:50AM -0700, Andrii Nakryiko wrote:
> On Thu, Jun 9, 2022 at 3:03 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Thu, Jun 9, 2022 at 11:27 AM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jun 7, 2022 at 9:29 PM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, May 31, 2022 at 4:24 PM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, May 25, 2022 at 4:40 AM Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > hi,
> > > > > > Alexei suggested to use prog->active instead global bpf_prog_active
> > > > > > for programs attached with kprobe multi [1].
> > > > > >
> > > > > > AFAICS this will bypass bpf_disable_instrumentation, which seems to be
> > > > > > ok for some places like hash map update, but I'm not sure about other
> > > > > > places, hence this is RFC post.
> > > > > >
> > > > > > I'm not sure how are kprobes different to trampolines in this regard,
> > > > > > because trampolines use prog->active and it's not a problem.
> > > > > >
> > > > > > thoughts?
> > > > > >
> > > > >
> > > > > Let's say we have two kernel functions A and B? B can be called from
> > > > > BPF program though some BPF helper, ok? Now let's say I have two BPF
> > > > > programs kprobeX and kretprobeX, both are attached to A and B. With
> > > > > using prog->active instead of per-cpu bpf_prog_active, what would be
> > > > > the behavior when A is called somewhere in the kernel.
> > > > >
> > > > > 1. A is called
> > > > > 2. kprobeX is activated for A, calls some helper which eventually calls B
> > > > >   3. kprobeX is attempted to be called for B, but is skipped due to prog->active
> > > > >   4. B runs
> > > > >   5. kretprobeX is activated for B, calls some helper which eventually calls B
> > > > >     6. kprobeX is ignored (prog->active > 0)
> > > > >     7. B runs
> > > > >     8. kretprobeX is ignored (prog->active > 0)
> > > > > 9. kretprobeX is activated for A, calls helper which calls B
> > > > >   10. kprobeX is activated for B
> > > > >     11. kprobeX is ignored (prog->active > 0)
> > > >
> > > > not correct. kprobeX actually runs.
> > > > but the end result is correct.
> > > >
> > >
> > > Right, it was a long sequence, but you got the idea :)

The above analysis was actually incorrect.
There are three kprobe flavors: int3, opt, ftrace.
while multi-kprobe is based on fprobe.
kretprobe can be traditional and rethook based.
In all of these mechanisms there is at least ftrace_test_recursion_trylock()
and for kprobes there is kprobe_running (per-cpu current_kprobe) filter
that acts as bpf_prog_active.

So this:
1. kprobeX for A
2. kretprobeX for B
3. kretprobeX for A
4. kprobeX for B
doesn't seem possible.
Unless there is reproducer of above behavior there is no point using above
as a design argument.

> > > > It's awful. We have to fix it.
> > >
> > > You can call it "a fix" if you'd like, but it's changing a very
> > > user-visible behavior and guarantees on which users relied for a
> > > while. So even if we switch to per-prog protection it will have to be
> > > some sort of opt-in (flag, new program type, whatever).
> >
> > No opt-in allowed for fixes and it's a bug fix.
> > No one should rely on broken kernel behavior.
> > If retsnoop rely on that it's really retsnoop's fault.
> 
> No point in arguing if we can't even agree on whether this is a bug or
> not, sorry.
> 
> Getting kretprobe invocation out of the blue without getting
> corresponding kprobe invocation first (both of which were successfully
> attached) seems like more of a bug to me. But perhaps that's a matter
> of subjective opinion.

The issue of kprobe/kretprobe mismatch was known for long time.
First maxactive was an issue. It should be solved by rethook now.
Then kprobe/kretprobe attach is not atomic.
bpf prog attaching kprobe and kretprobe to the same func cannot assume
that they will always pair. bcc scripts had to deal with this.

Say, kprobe/kretprobe will become fentry/fexit like with prog->active only.
If retsnoop wants to do its own per-cpu prog_active counter it will
prevent out-of-order fentry/fexit for the case when the same prog
is attached to before-bpf-func and during-bpf-func. Only retsnoop's progs
will miss during-bpf-func events. Such policy decisions is localized to one tool.
All other users will see the events they care about.
kprobe/kretprobe/fprobe run handlers with preemption disabled which makes
these mechanisms unfriendly to RT. Their design shows that they're not suitable
for always-on running. When bpf+kprobe was introduced 7 years ago it wasn't
meant to be 24-7 either. bpf_prog_active is modeled like current_kprobe.
It was addressing the deadlock issue with spinlocks in maps.
Recursion was not an issue.
Sadly kprobe/kretprobe/fprobe look unfixable in this form. Too much work
needs to be done to enable something like:
user A attaches prog A to func X. X runs, prog A runs with migration disabled.
Preemption. Something else starts on this cpu. Another user B attaching prog B
to func Y should see its prog being executed.
With kprobes it looks impossible. While fentry was designed with this use case
in mind. Note it's not about sleepable progs. Normal bpf progs can be preempted.

Back to Jiri's question whether we can remove bpf_prog_active from
trace_call_bpf.  Yes. We can and we should. It will allow bperf to collect
stack traces that include bpf progs. It's an important fix. Incorrect retsnoop
assumptions about kprobes will not be affected.