On Wed, Jul 30, 2014 at 9:59 AM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote: > On Tue, Jul 29, 2014 at 04:30:58PM -0700, Andy Lutomirski wrote: >> On Tue, Jul 29, 2014 at 1:54 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> > On Jul 29, 2014 12:22 PM, "Oleg Nesterov" <oleg@xxxxxxxxxx> wrote: >> >> >> >> Andy, to avoid the confusion: I am not trying to review this changes. >> >> As you probably know my understanding of asm code in entry.S is very >> >> limited. >> >> >> >> Just a couple of questions to ensure I understand this correctly. >> >> >> >> On 07/28, Andy Lutomirski wrote: >> >> > >> >> > This is both a cleanup and a speedup. It reduces overhead due to >> >> > installing a trivial seccomp filter by 87%. The speedup comes from >> >> > avoiding the full syscall tracing mechanism for filters that don't >> >> > return SECCOMP_RET_TRACE. >> >> >> >> And only after I look at 5/5 I _seem_ to actually understand where >> >> this speedup comes from. >> >> >> >> So. Currently tracesys: path always lead to "iret" after syscall, with >> >> this change we can avoid it if phase_1() returns zero, correct? >> >> >> >> And, this also removes the special TIF_SYSCALL_AUDIT-only case in entry.S, >> >> cool. >> >> >> >> I am wondering if we can do something similar with do_notify_resume() ? >> >> >> >> >> >> Stupid question. To simplify, lets forget that syscall_trace_enter() >> >> already returns the value. Can't we simplify the asm code if we do >> >> not export 2 functions, but make syscall_trace_enter() return >> >> "bool slow_path_is_needed". So that "tracesys:" could do >> >> >> >> // pseudo code >> >> >> >> tracesys: >> >> SAVE_REST >> >> FIXUP_TOP_OF_STACK >> >> >> >> call syscall_trace_enter >> >> >> >> if (!slow_path_is_needed) { >> >> addq REST_SKIP, %rsp >> >> jmp system_call_fastpath >> >> } >> >> >> >> ... >> >> >> >> ? >> >> >> >> Once again, I am just curious, it is not that I actually suggest to consider >> >> this option. >> > >> > We could, but this would lose a decent amount of the speedup. I could >> > try it and benchmark it, but I'm guessing that the save and restore is >> > kind of expensive. This will make audit slower than it currently is, >> > which may also annoy some people. (Not me.) >> > >> > I'm also not convinced that it would be much simpler. My code is currently: >> > >> > tracesys: >> > leaq -REST_SKIP(%rsp), %rdi >> > movq $AUDIT_ARCH_X86_64, %rsi >> > call syscall_trace_enter_phase1 >> > test %rax, %rax >> > jnz tracesys_phase2 /* if needed, run the slow path */ >> > LOAD_ARGS 0 /* else restore clobbered regs */ >> > jmp system_call_fastpath /* and return to the fast path */ >> > >> > tracesys_phase2: >> > SAVE_REST >> > FIXUP_TOP_OF_STACK %rdi >> > movq %rsp, %rdi >> > movq $AUDIT_ARCH_X86_64, %rsi >> > movq %rax,%rdx >> > call syscall_trace_enter_phase2 >> > >> > LOAD_ARGS ARGOFFSET, 1 >> > RESTORE_REST >> > >> > ... slow path here ... >> > >> > It would end up looking more like (totally untested): >> > >> > tracesys: >> > SAVE_REST >> > FIXUP_TOP_OF_STACK %rdi >> > mov %rsp, %rdi >> > movq $AUDIT_ARCH_X86_64, %rsi >> > call syscall_trace_enter >> > LOAD_ARGS >> > RESTORE_REST >> > test [whatever condition] >> > j[cond] system_call_fastpath >> > >> > ... slow path here ... >> > >> > So it's a bit simpler. Oddly, the ia32entry code doesn't have this >> > multiple syscall path distinction. >> > >> > SAVE_REST is 6 movq instructions and a subq. FIXUP_TOP_OF_STACK is 7 >> > movqs (and 8 if I ever get my way). RESTORE_TOP_OF_STACK is 4. >> > RESTORE_REST is 6 movqs and an adsq. So we're talking about avoiding >> > 21 movqs, and addq, and a subq. That may be significant. (And I >> > suspect that the difference is much larger on platforms like arm64, >> > but that's a separate issue.) >> >> To put some more options on the table: there's an argument to be made >> that the whole fast-path/slow-path split isn't worth it. We could >> unconditionally set up a full frame for all syscalls. This means: >> >> No FIXUP_TOP_OF_STACK. Instead, the system_call entry sets up RSP, >> SS, CS, RCX, and EFLAGS right away. That's five stores for all >> syscalls instead of two loads and five stores for syscalls that need >> it. But it also gets rid of RESTORE_TOP_OF_STACK completely. >> >> No more bugs involving C code that assumes a full stack frame when no >> such frame exists inside syscall code. >> >> We could possibly remove a whole bunch of duplicated code. >> >> The upshot would be simpler code, faster slow-path syscalls, and >> slower fast-path syscalls (but probably not much slower). On the >> other hand, there's zero chance that this would be ready for 3.17. > > Indeed. > > If we ever take that direction (ie: remove that partial frame optimization), > there is that common part that we can find in many archs when they return > from syscalls, exceptions or interrupts which could be more or less consolidated as: > > void sysret_check(struct pt_regs *regs) > { > while(test_thread_flag(TIF_ALLWORK_MASK)) { > int resched = need_resched(); > > local_irq_enable(); > if (resched) > schedule(); > else > do_notify_resume(regs); > local_irq_disable() > } > } > > But well, probably the syscall fastpath is still worth it. And yet x86_64 has this code implemented in assembly even in the slowpath. Go figure. --Andy