On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote: > >> +asm( > >> +".pushsection .text;" > >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" > >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > >> +"__raw_callee_save___kvm_vcpu_is_preempted:" > >> +FRAME_BEGIN > >> +"push %rdi;" > >> +"push %rdx;" > >> +"movslq %edi, %rdi;" > >> +"movq $steal_time+16, %rax;" > >> +"movq __per_cpu_offset(,%rdi,8), %rdx;" > >> +"cmpb $0, (%rdx,%rax);" Could we not put the $steal_time+16 displacement as an immediate in the cmpb and save a whole register here? That way we'd end up with something like: asm(" push %rdi; movslq %edi, %rdi; movq __per_cpu_offset(,%rdi,8), %rax; cmpb $0, %[offset](%rax); setne %al; pop %rdi; " : : [offset] "i" (((unsigned long)&steal_time) + offsetof(struct steal_time, preempted))); And if we could get rid of the sign extend on edi we could avoid all the push-pop nonsense, but I'm not sure I see how to do that (then again, this asm foo isn't my strongest point). > >> +"setne %al;" > >> +"pop %rdx;" > >> +"pop %rdi;" > >> +FRAME_END > >> +"ret;" > >> +".popsection"); > >> + > >> +#endif > >> + > >> /* > >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > >> */ > > That should work for now. I have done something similar for > > __pv_queued_spin_unlock. However, this has the problem of creating a > > dependency on the exact layout of the steal_time structure. Maybe the > > constant 16 can be passed in as a parameter offsetof(struct > > kvm_steal_time, preempted) to the asm call. Yeah it should be well possible to pass that in. But ideally we'd have GCC grow something like __attribute__((callee_saved)) or somesuch and it would do all this for us. > One more thing, that will improve KVM performance, but it won't help Xen. People still use Xen? ;-) In any case, their implementation looks very similar and could easily crib this. > I looked into the assembly code for rwsem_spin_on_owner, It need to save > and restore 2 additional registers with my patch. Doing it your way, > will transfer the save and restore overhead to the assembly code. > However, __kvm_vcpu_is_preempted() is called multiple times per > invocation of rwsem_spin_on_owner. That function is simple enough that > making __kvm_vcpu_is_preempted() callee-save won't produce much compiler > optimization opportunity. This is because of that noinline, right? Otherwise it would've been folded and register pressure would be much higher. > The outer function rwsem_down_write_failed() > does appear to be a bit bigger (from 866 bytes to 884 bytes) though. I suspect GCC is being clever and since all this is static it plays games with the calling convention and pushes these clobbers out.