We just introduced a new PV interface that screams for documentation. So here it is - a shiny new and awesome text file describing the internal works of the PPC KVM paravirtual interface. Signed-off-by: Alexander Graf <agraf@xxxxxxx> --- v1 -> v2: - clarify guest implementation - clarify that privileged instructions still work - explain safe MSR bits - Fix dsisr patch description - change hypervisor calls to use new register values --- Documentation/kvm/ppc-pv.txt | 185 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 185 insertions(+), 0 deletions(-) create mode 100644 Documentation/kvm/ppc-pv.txt diff --git a/Documentation/kvm/ppc-pv.txt b/Documentation/kvm/ppc-pv.txt new file mode 100644 index 0000000..82de6c6 --- /dev/null +++ b/Documentation/kvm/ppc-pv.txt @@ -0,0 +1,185 @@ +The PPC KVM paravirtual interface +================================= + +The basic execution principle by which KVM on PowerPC works is to run all kernel +space code in PR=1 which is user space. This way we trap all privileged +instructions and can emulate them accordingly. + +Unfortunately that is also the downfall. There are quite some privileged +instructions that needlessly return us to the hypervisor even though they +could be handled differently. + +This is what the PPC PV interface helps with. It takes privileged instructions +and transforms them into unprivileged ones with some help from the hypervisor. +This cuts down virtualization costs by about 50% on some of my benchmarks. + +The code for that interface can be found in arch/powerpc/kernel/kvm* + +Querying for existence +====================== + +To find out if we're running on KVM or not, we overlay the PVR register. Usually +the PVR register contains an id that identifies your CPU type. If, however, you +pass KVM_PVR_PARA in the register that you want the PVR result in, the register +still contains KVM_PVR_PARA after the mfpvr call. + + LOAD_REG_IMM(r5, KVM_PVR_PARA) + mfpvr r5 + [r5 still contains KVM_PVR_PARA] + +Once determined to run under a PV capable KVM, you can now use hypercalls as +described below. + +PPC hypercalls +============== + +The only viable ways to reliably get from guest context to host context are: + + 1) Call an invalid instruction + 2) Call the "sc" instruction with a parameter to "sc" + 3) Call the "sc" instruction with parameters in GPRs + +Method 1 is always a bad idea. Invalid instructions can be replaced later on +by valid instructions, rendering the interface broken. + +Method 2 also has downfalls. If the parameter to "sc" is != 0 the spec is +rather unclear if the sc is targeted directly for the hypervisor or the +supervisor. It would also require that we read the syscall issuing instruction +every time a syscall is issued, slowing down guest syscalls. + +Method 3 is what KVM uses. We pass magic constants (KVM_SC_MAGIC_R0 and +KVM_SC_MAGIC_R3) in r0 and r3 respectively. If a syscall instruction with these +magic values arrives from the guest's kernel mode, we take the syscall as a +hypercall. + +The parameters are as follows: + + r0 KVM_SC_MAGIC_R0 + r3 KVM_SC_MAGIC_R3 Return code + r4 Hypercall number + r5 First parameter + r6 Second parameter + r7 Third parameter + r8 Fourth parameter + +Hypercall definitions are shared in generic code, so the same hypercall numbers +apply for x86 and powerpc alike. + +The magic page +============== + +To enable communication between the hypervisor and guest there is a new shared +page that contains parts of supervisor visible register state. The guest can +map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE. + +With this hypercall issued the guest always gets the magic page mapped at the +desired location in effective and physical address space. For now, we always +map the page to -4096. This way we can access it using absolute load and store +functions. The following instruction reads the first field of the magic page: + + ld rX, -4096(0) + +The interface is designed to be extensible should there be need later to add +additional registers to the magic page. If you add fields to the magic page, +also define a new hypercall feature to indicate that the host can give you more +registers. Only if the host supports the additional features, make use of them. + +The magic page has the following layout as described in +arch/powerpc/include/asm/kvm_para.h: + +struct kvm_vcpu_arch_shared { + __u64 scratch1; + __u64 scratch2; + __u64 scratch3; + __u64 critical; /* Guest may not get interrupts if == r1 */ + __u64 sprg0; + __u64 sprg1; + __u64 sprg2; + __u64 sprg3; + __u64 srr0; + __u64 srr1; + __u64 dar; + __u64 msr; + __u32 dsisr; + __u32 int_pending; /* Tells the guest if we have an interrupt */ +}; + +Additions to the page must only occur at the end. Struct fields are always 32 +bit aligned. + +MSR bits +======== + +The MSR contains bits that require hypervisor intervention and bits that do +not require direct hypervisor intervention because they only get interpreted +when entering the guest or don't have any impact on the hypervisor's behavior. + +The following bits are safe to be set inside the guest: + + MSR_EE + MSR_RI + MSR_CR + MSR_ME + +If any other bit changes in the MSR, please still use mtmsr(d). + +Patched instructions +==================== + +The "ld" and "std" instructions are transormed to "lwz" and "stw" instructions +respectively on 32 bit systems with an added offset of 4 to accomodate for big +endianness. + +The following is a list of mapping the Linux kernel performs when running as +guest. Implementing any of those mappings is optional, as the instruction traps +also act on the shared page. So calling privileged instructions still works as +before. + +From To +==== == + +mfmsr rX ld rX, magic_page->msr +mfsprg rX, 0 ld rX, magic_page->sprg0 +mfsprg rX, 1 ld rX, magic_page->sprg1 +mfsprg rX, 2 ld rX, magic_page->sprg2 +mfsprg rX, 3 ld rX, magic_page->sprg3 +mfsrr0 rX ld rX, magic_page->srr0 +mfsrr1 rX ld rX, magic_page->srr1 +mfdar rX ld rX, magic_page->dar +mfdsisr rX lwz rX, magic_page->dsisr + +mtmsr rX std rX, magic_page->msr +mtsprg 0, rX std rX, magic_page->sprg0 +mtsprg 1, rX std rX, magic_page->sprg1 +mtsprg 2, rX std rX, magic_page->sprg2 +mtsprg 3, rX std rX, magic_page->sprg3 +mtsrr0 rX std rX, magic_page->srr0 +mtsrr1 rX std rX, magic_page->srr1 +mtdar rX std rX, magic_page->dar +mtdsisr rX stw rX, magic_page->dsisr + +tlbsync nop + +mtmsrd rX, 0 b <special mtmsr section> +mtmsr b <special mtmsr section> + +mtmsrd rX, 1 b <special mtmsrd section> + +[BookE only] +wrteei [0|1] b <special wrteei section> + + +Some instructions require more logic to determine what's going on than a load +or store instruction can deliver. To enable patching of those, we keep some +RAM around where we can live translate instructions to. What happens is the +following: + + 1) copy emulation code to memory + 2) patch that code to fit the emulated instruction + 3) patch that code to return to the original pc + 4 + 4) patch the original instruction to branch to the new code + +That way we can inject an arbitrary amount of code as replacement for a single +instruction. This allows us to check for pending interrupts when setting EE=1 +for example. + -- 1.6.0.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html