On Mon, May 11, 2020 at 10:05:25PM +0200, Daniel Borkmann wrote: > Hey Maciej, > > On 5/11/20 4:39 PM, Maciej Fijalkowski wrote: > > Hi! > > > > Today, BPF x86-64 JIT is preserving all of the callee-saved registers > > for each BPF program being JITed, even when none of the R6-R9 registers > > are used by the BPF program. Furthermore the tail call counter is always > > pushed/popped to/from the stack even when there is no tail call usage in > > BPF program being JITed. Optimization can be introduced that would > > detect the usage of R6-R9 and based on that push/pop to/from the stack > > only what is needed. Same goes for tail call counter. > > > > Results look promising for such instruction reduction. Below are the > > numbers for xdp1 sample on FVL 40G NIC receiving traffic from pktgen: > > > > * With optimization: 22.3 Mpps > > * Without: 19.0 mpps > > > > So it's around 15% of performance improvement. Note that xdp1 is not > > using any of callee saved registers, nor the tail call, hence such > > speed-up. > > > > There is one detail that needs to be handled though. > > > > Currently, x86-64 JIT tail call implementation is skipping the prologue > > of target BPF program that has constant size. With the mentioned > > optimization implemented, each particular BPF program that might be > > inserted onto the prog array map and therefore be the target of tail > > call, could have various prologue size. > > > > Let's have some pseudo-code example: > > > > func1: > > pro > > code > > epi > > > > func2: > > pro > > code' > > epi > > > > func3: > > pro > > code'' > > epi > > > > Today, pro and epi are always the same (9/7) instructions. So a tail > > call from func1 to func2 is just a: > > > > jump func2 + sizeof pro in bytes (PROLOGUE_SIZE) > > > > With the optimization: > > > > func1: > > pro > > code > > epi > > > > func2: > > pro' > > code' > > epi' > > > > func3: > > pro'' > > code'' > > epi'' > > > > For making the tail calls up and running with the mentioned optimization > > in place, x86-64 JIT should emit the pop registers instructions > > that were pushed on prologue before the actual jump. Jump offset should > > skip the instructions that are handling rbp/rsp, not the whole prologue. > > > > A tail call within func1 would then need to be: > > epi -> pop what pro pushed, but no leave/ret instructions > > jump func2 + 16 // first push insn of pro'; if no push, then this would > > // a direct jump to code' > > > > Magic value of 16 comes from count of bytes that represent instructions > > that are skipped: > > 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > > 55 push %rbp > > 48 89 e5 mov %rsp,%rbp > > 48 81 ec 08 00 00 00 sub $0x8,%rsp > > > > which would in many cases add *more* instructions for tailcalls. If none > > of callee-saved registers are used, then there would be no overhead with > > such optimization in place. > > > > I'm not sure how to measure properly the impact on the BPF programs that > > are utilizing tail calls. Any suggestions? > > Right, so far the numbers above (no callee saved registers, no tail calls) > are really the best case scenario. I think programs not using callee saved > registers are probably very limited in what they do, and tail calls are often > used as well (although good enough for AF_XDP, for example). So I wonder how > far we would regress with callee saved registers and tail calls. For Cilium > right now you can roughly assume a worst case tail call depth of ~6 with static > jumps (that we patch to jmp/nop). Only in one case we have a tail call map > index that is non-static. In terms of registers, assume all of them are used > one way or another. If you could check the impact in such setting, that would > be great. > > > Daniel, Alexei, what is your view on this? > > I think performance wise this would be both pro and con depending how tail > calls are used. One upside however, and I think you didn't mention it here > would be that we don't need to clamp used stack space to 512, so we could > actually track how much stack is used (or if any is used at all) and adapt > it between tail calls? Depending on the numbers, if we'd go that route, it > should rather be generalized and tracked via verifier so all JITs can behave > the same (and these workarounds in verifier lifted). But then 15% performance > improvement as you state above is a lot, probably we might regress at least > as much as well in your benchmark. I wonder whether there should be a knob > for it, though it's mainly implementation detail.. I was thinking about knob too, but users are rarely going to touch it, so if we go for opt-in it will mostly be unused except by few folks. So I think it's better to go with this approach unconditionally, but first I'd like to see the performance numbers in how it regresses the common case. AF_XDP's empty prog that does 'return XDP_PASS' is a rare case and imo not worth optimizing for, but I see a lot of value if this approach allows to lift tail_call vs bpf2bpf restriction. It looks to me that bpf2bpf will be able to work. prog_A will call into prog_B and if that prog does any kind of tail_call that tail_call will eventually finish and the execution will return to prog_A as normal. So I'd like to ask for two things: 1. perf numbers for something like progs/bpf_flow.c before and after 2. removal of bpf2bpf vs tail_call restriction and new selftests to prove that it's working. that will include droping 512 stack clamp.