On Thu, 2014-06-12 at 18:04 +0200, Alexander Graf wrote: > On 06/02/2014 05:50 PM, Mihai Caraman wrote: > > On book3e, KVM uses load external pid (lwepx) dedicated instruction to read > > guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI > > and LRAT), generated by loading a guest address, needs to be handled by KVM. > > These exceptions are generated in a substituted guest translation context > > (EPLC[EGS] = 1) from host context (MSR[GS] = 0). > > > > Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1), > > doing minimal checks on the fast path to avoid host performance degradation. > > lwepx exceptions originate from host state (MSR[GS] = 0) which implies > > additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking > > at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context > > Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious > > too intrusive for the host. > > > > Read guest last instruction from kvmppc_load_last_inst() by searching for the > > physical address and kmap it. This address the TODO for TLB eviction and > > execute-but-not-read entries, and allow us to get rid of lwepx until we are > > able to handle failures. > > > > A simple stress benchmark shows a 1% sys performance degradation compared with > > previous approach (lwepx without failure handling): > > > > time for i in `seq 1 10000`; do /bin/echo > /dev/null; done > > > > real 0m 8.85s > > user 0m 4.34s > > sys 0m 4.48s > > > > vs > > > > real 0m 8.84s > > user 0m 4.36s > > sys 0m 4.44s > > > > An alternative solution, to handle lwepx exceptions in KVM, is to temporary > > highjack the interrupt vector from host. Some cores share host IVOR registers > > between hardware threads, which is the case of FSL e6500, which impose additional > > synchronization logic for this solution to work. This optimized solution can > > be developed later on top of this patch. > > > > Signed-off-by: Mihai Caraman <mihai.caraman@xxxxxxxxxxxxx> > > --- > > v3: > > - reworked patch description > > - use unaltered kmap addr for kunmap > > - get last instruction before beeing preempted > > > > v2: > > - reworked patch description > > - used pr_* functions > > - addressed cosmetic feedback > > > > arch/powerpc/kvm/booke.c | 32 ++++++++++++ > > arch/powerpc/kvm/bookehv_interrupts.S | 37 ++++---------- > > arch/powerpc/kvm/e500_mmu_host.c | 93 +++++++++++++++++++++++++++++++++++ > > 3 files changed, 134 insertions(+), 28 deletions(-) > > > > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c > > index 34a42b9..4ef52a8 100644 > > --- a/arch/powerpc/kvm/booke.c > > +++ b/arch/powerpc/kvm/booke.c > > @@ -880,6 +880,8 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > int r = RESUME_HOST; > > int s; > > int idx; > > + u32 last_inst = KVM_INST_FETCH_FAILED; > > + enum emulation_result emulated = EMULATE_DONE; > > > > /* update before a new last_exit_type is rewritten */ > > kvmppc_update_timing_stats(vcpu); > > @@ -887,6 +889,15 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > /* restart interrupts if they were meant for the host */ > > kvmppc_restart_interrupt(vcpu, exit_nr); > > > > + /* > > + * get last instruction before beeing preempted > > + * TODO: for e6500 check also BOOKE_INTERRUPT_LRAT_ERROR & ESR_DATA > > + */ > > + if (exit_nr == BOOKE_INTERRUPT_DATA_STORAGE || > > + exit_nr == BOOKE_INTERRUPT_DTLB_MISS || > > + exit_nr == BOOKE_INTERRUPT_HV_PRIV) > > Please make this a switch() - that's easier to read. > > > + emulated = kvmppc_get_last_inst(vcpu, false, &last_inst); > > + > > local_irq_enable(); > > > > trace_kvm_exit(exit_nr, vcpu); > > @@ -895,6 +906,26 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > run->exit_reason = KVM_EXIT_UNKNOWN; > > run->ready_for_interrupt_injection = 1; > > > > + switch (emulated) { > > + case EMULATE_AGAIN: > > + r = RESUME_GUEST; > > + goto out; > > + > > + case EMULATE_FAIL: > > + pr_debug("%s: emulation at %lx failed (%08x)\n", > > + __func__, vcpu->arch.pc, last_inst); > > + /* For debugging, encode the failing instruction and > > + * report it to userspace. */ > > + run->hw.hardware_exit_reason = ~0ULL << 32; > > + run->hw.hardware_exit_reason |= last_inst; > > + kvmppc_core_queue_program(vcpu, ESR_PIL); > > + r = RESUME_HOST; > > + goto out; > > + > > + default: > > + break; > > + } > > I think you can just put this into a function. > > Scott, I think the patch overall looks quite good. Can you please check > as well and if you agree give it your reviewed-by? Mike, when Scott > gives you a reviewed-by, please include it for the next version. > > > Alex > > > + > > switch (exit_nr) { > > case BOOKE_INTERRUPT_MACHINE_CHECK: > > printk("MACHINE CHECK: %lx\n", mfspr(SPRN_MCSR)); > > @@ -1184,6 +1215,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, > > BUG(); > > } > > > > +out: > > /* > > * To avoid clobbering exit_reason, only check for signals if we > > * aren't already exiting to userspace for some other reason. > > diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S > > index 6ff4480..e000b39 100644 > > --- a/arch/powerpc/kvm/bookehv_interrupts.S > > +++ b/arch/powerpc/kvm/bookehv_interrupts.S > > @@ -121,38 +121,14 @@ > > 1: > > > > .if \flags & NEED_EMU > > - /* > > - * This assumes you have external PID support. > > - * To support a bookehv CPU without external PID, you'll > > - * need to look up the TLB entry and create a temporary mapping. > > - * > > - * FIXME: we don't currently handle if the lwepx faults. PR-mode > > - * booke doesn't handle it either. Since Linux doesn't use > > - * broadcast tlbivax anymore, the only way this should happen is > > - * if the guest maps its memory execute-but-not-read, or if we > > - * somehow take a TLB miss in the middle of this entry code and > > - * evict the relevant entry. On e500mc, all kernel lowmem is > > - * bolted into TLB1 large page mappings, and we don't use > > - * broadcast invalidates, so we should not take a TLB miss here. > > - * > > - * Later we'll need to deal with faults here. Disallowing guest > > - * mappings that are execute-but-not-read could be an option on > > - * e500mc, but not on chips with an LRAT if it is used. > > - */ > > - > > - mfspr r3, SPRN_EPLC /* will already have correct ELPID and EGS */ > > PPC_STL r15, VCPU_GPR(R15)(r4) > > PPC_STL r16, VCPU_GPR(R16)(r4) > > PPC_STL r17, VCPU_GPR(R17)(r4) > > PPC_STL r18, VCPU_GPR(R18)(r4) > > PPC_STL r19, VCPU_GPR(R19)(r4) > > - mr r8, r3 > > PPC_STL r20, VCPU_GPR(R20)(r4) > > - rlwimi r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS > > PPC_STL r21, VCPU_GPR(R21)(r4) > > - rlwimi r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR > > PPC_STL r22, VCPU_GPR(R22)(r4) > > - rlwimi r8, r10, EPC_EPID_SHIFT, EPC_EPID > > PPC_STL r23, VCPU_GPR(R23)(r4) > > PPC_STL r24, VCPU_GPR(R24)(r4) > > PPC_STL r25, VCPU_GPR(R25)(r4) > > @@ -162,10 +138,15 @@ > > PPC_STL r29, VCPU_GPR(R29)(r4) > > PPC_STL r30, VCPU_GPR(R30)(r4) > > PPC_STL r31, VCPU_GPR(R31)(r4) > > - mtspr SPRN_EPLC, r8 > > - isync > > - lwepx r9, 0, r5 > > - mtspr SPRN_EPLC, r3 > > + > > + /* > > + * We don't use external PID support. lwepx faults would need to be > > + * handled by KVM and this implies aditional code in DO_KVM (for > > + * DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which > > + * is too intrusive for the host. Get last instuction in > > + * kvmppc_get_last_inst(). > > + */ > > + li r9, KVM_INST_FETCH_FAILED > > stw r9, VCPU_LAST_INST(r4) > > .endif > > > > diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c > > index f692c12..0528fe5 100644 > > --- a/arch/powerpc/kvm/e500_mmu_host.c > > +++ b/arch/powerpc/kvm/e500_mmu_host.c > > @@ -606,10 +606,103 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr, > > } > > } > > > > +#ifdef CONFIG_KVM_BOOKE_HV > > int kvmppc_load_last_inst(struct kvm_vcpu *vcpu, bool prev, u32 *instr) > > { > > + gva_t geaddr; > > + hpa_t addr; > > + hfn_t pfn; > > + hva_t eaddr; > > + u32 mas0, mas1, mas2, mas3; > > + u64 mas7_mas3; > > + struct page *page; > > + unsigned int addr_space, psize_shift; > > + bool pr; > > + unsigned long flags; > > + > > + WARN_ON_ONCE(prev); What are the semantics of "prev"? > > + /* Search TLB for guest pc to get the real address */ > > + geaddr = kvmppc_get_pc(vcpu); > > + > > + addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG; > > + > > + local_irq_save(flags); > > + mtspr(SPRN_MAS6, (vcpu->arch.pid << MAS6_SPID_SHIFT) | addr_space); > > + mtspr(SPRN_MAS5, MAS5_SGS | vcpu->kvm->arch.lpid); > > + asm volatile("tlbsx 0, %[geaddr]\n" : : > > + [geaddr] "r" (geaddr)); > > + mtspr(SPRN_MAS5, 0); > > + mtspr(SPRN_MAS8, 0); > > + mas0 = mfspr(SPRN_MAS0); > > + mas1 = mfspr(SPRN_MAS1); > > + mas2 = mfspr(SPRN_MAS2); > > + mas3 = mfspr(SPRN_MAS3); > > + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) << 32) | mas3; > > + local_irq_restore(flags); Where is mas0 used? On 64-bit you could read directly from SPRN_MAS7_MAS3. > > + /* > > + * If the TLB entry for guest pc was evicted, return to the guest. > > + * There are high chances to find a valid TLB entry next time. > > + */ > > + if (!(mas1 & MAS1_VALID)) > > + return EMULATE_AGAIN; > > + > > + /* > > + * Another thread may rewrite the TLB entry in parallel, don't > > + * execute from the address if the execute permission is not set > > + */ > > + pr = vcpu->arch.shared->msr & MSR_PR; > > + if ((pr && !(mas3 & MAS3_UX)) || (!pr && !(mas3 & MAS3_SX))) { > > + pr_debug("Instuction emulation from a guest page\n" > > + "withot execute permission\n"); Don't split user-visible strings. s/withot/without/ Give more context in the message, such as __func__ and the address. > > + return EMULATE_FAIL; > > + } > > + > > + /* > > + * We will map the real address through a cacheable page, so we will > > + * not support cache-inhibited guest pages. Fortunately emulated > > + * instructions should not live there. > > + */ > > + if (mas2 & MAS2_I) { > > + pr_debug("Instuction emulation from cache-inhibited\n" > > + "guest pages is not supported\n"); > > + return EMULATE_FAIL; > > + } Mismatches in M or W are bad as well, but shouldn't the page_is_ram() protect us? Except when LRAT is used, but there we can't prevent mismatches anyway. > > + /* Get page size */ > > + psize_shift = MAS1_GET_TSIZE(mas1) + 10; > > + > > + /* Map a page and get guest's instruction */ > > + addr = (mas7_mas3 & (~0ULL << psize_shift)) | > > + (geaddr & ((1ULL << psize_shift) - 1ULL)); > > + pfn = addr >> PAGE_SHIFT; > > + > > + /* Guard us against emulation from devices area */ > > + if (unlikely(!page_is_ram(pfn))) { > > + pr_debug("Instruction emulation from non-RAM host\n" > > + "pages is not supported\n"); > > + return EMULATE_FAIL; > > + } > > + > > + if (unlikely(!pfn_valid(pfn))) { > > + pr_debug("Invalid frame number\n"); > > + return EMULATE_FAIL; > > + } Can we ever have page_is_ram(pfn) && !pfn_valid(pfn)? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html