Re: [PATCH v3 bpf-next 02/18] bpf: Add bpf_arch_text_poke() helper

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 8 Nov 2019 15:05:25 -0800

On Fri, Nov 08, 2019 at 10:36:24PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 08, 2019 at 11:32:41AM -0800, Alexei Starovoitov wrote:
> > On Fri, Nov 8, 2019 at 5:42 AM Alexei Starovoitov <ast@xxxxxx> wrote:
> > >
> > > On 11/8/19 1:36 AM, Peter Zijlstra wrote:
> > > > On Fri, Nov 08, 2019 at 10:11:56AM +0100, Peter Zijlstra wrote:
> > > >> On Thu, Nov 07, 2019 at 10:40:23PM -0800, Alexei Starovoitov wrote:
> > > >>> Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
> > > >>> nops/calls in kernel text into calls into BPF trampoline and to patch
> > > >>> calls/nops inside BPF programs too.
> > > >>
> > > >> This thing assumes the text is unused, right? That isn't spelled out
> > > >> anywhere. The implementation is very much unsafe vs concurrent execution
> > > >> of the text.
> > > >
> > > > Also, what NOP/CALL instructions will you be hijacking? If you're
> > > > planning on using the fentry nops, then what ensures this and ftrace
> > > > don't trample on one another? Similar for kprobes.
> > > >
> > > > In general, what ensures every instruction only has a single modifier?
> > >
> > > Looks like you didn't bother reading cover letter and missed a month
> 
> I did indeed not. A Changelog should be self sufficient and this one is
> sorely lacking. The cover leter is not preserved and should therefore
> not contain anything of value that is not also covered in the
> Changelogs.
> 
> > > of discussions between my and Steven regarding exactly this topic
> > > though you were directly cc-ed in all threads :(
> 
> I read some of it; it is a sad fact that I cannot read all email in my
> inbox, esp. not if, like in the last week or so, I'm busy hunting a
> regression.
> 
> And what I did remember of the emails I saw left me with the questions
> that were not answered by the changelog.
> 
> > > tldr for kernel fentry nops it will be converted to use
> > > register_ftrace_direct() whenever it's available.
> 
> So why the rush and not wait for that work to complete? It appears to me
> that without due coordination between bpf and ftrace badness could
> happen.
> 
> > > For all other nops, calls, jumps that are inside BPF programs BPF infra
> > > will continue modifying them through this helper.
> > > Daniel's upcoming bpf_tail_call() optimization will use text_poke as well.
> 
> This is probably off topic, but isn't tail-call optimization something
> done at JIT time and therefore not in need ot text_poke()?

Not quite. bpf_tail_call() are done via prog_array which is indirect jmp and
it suffers from retpoline. The verifier can see that in a lot of cases the
prog_array is used with constant index into array instead of a variable. In
such case indirect jmps can be optimized with direct jmps. That is
essentially what Daniel's patches are doing that are building on top of
bpf_arch_text_poke() and trampoline that I'm introducing in this set.

Another set is being prepared by Bjorn that also builds on top of
bpf_arch_text_poke() and trampoline. It's serving the purpose of getting rid of
indirect call when driver calls into BPF program for the first time. We've
looked at your static_call and concluded that it doesn't quite cut for this use
case.

The third framework is worked on by Martin. Who is using BPF trampoline for
BPF-based TCP extensions. This bit is not related to indirect call/jmp
optimization, but needs trampoline.

> > I was thinking more about this.
> > Peter,
> > do you mind we apply your first patch:
> > https://lore.kernel.org/lkml/20191007081944.88332264.2@xxxxxxxxxxxxx/
> > to both tip and bpf-next trees?
> 
> That would indeed be a much better solution. I'll repost much of that on
> Monday, and then we'll work on getting at the very least that one patch
> in a tip/branch we can share.

Awesome! I can certainly wait till next week. I just don't want to miss the
merge window for the work that it is ready. More below.

> > Then I can use text_poke_bp() as-is without any additional ugliness
> > on my side that would need to be removed in few weeks.
> 
> This I do _NOT_ understand. Why are you willing to merge a known broken
> patch? What is the rush, why can't you wait for all the prerequisites to
> land?

People have deadlines and here I'm not talking about fb deadlines. If it was
only up to me I could have waited until yours and Steven's patches land in
Linus's tree. Then Dave would pick them up after the merge window into net-next
and bpf things would be ready for the next release. Which is in 1.5 + 2 + 8
weeks (assuming 1.5 weeks until merge window, 2 weeks merge window, and 8
weeks next release cycle).
But most of bpf things are ready. I have one more follow up to do for another
feature. The first 4-5 patches of my set will enable Bjorn, Daniel, and
Martin's work. So I'm mainly looking for a way to converge three trees during
the merge window with no conflicts.

Just saw that Steven posted his set. That is great. If you land your first part
of text_poke_pb() next week into tip it will enable us to cherry-pick the first
few patches from tip and continue with bpf trampoline in net-next. Then during
the merge window tip, Steven's and net-next land into Linus's tree. Then I'll
send small follow up to switch to Steven's register_ftrace_direct() in places
that can use it and the other bits of bpf will keep using yours text_poke_bp()
because it's for the code inside generated bpf progs, various generated
trampolines and such. The conversion of some of bpf bits to
register_ftrace_direct() can be delayed by a release if really necessary. Since
text_poke_bp() approach will work fine, just not as nice if there is a full
integration via ftrace.
imo it's the best path for 3 trees to converge without delaying things for bpf
folks by a full release. At the end the deadlines are met and a bunch of people
are unblocked and happy. I hope that explains the rush.