On 2012-02-11 16:25, Blue Swirl wrote: > On Fri, Feb 10, 2012 at 18:31, Jan Kiszka <jan.kiszka@xxxxxxxxxxx> wrote: >> This enables acceleration for MMIO-based TPR registers accesses of >> 32-bit Windows guest systems. It is mostly useful with KVM enabled, >> either on older Intel CPUs (without flexpriority feature, can also be >> manually disabled for testing) or any current AMD processor. >> >> The approach introduced here is derived from the original version of >> qemu-kvm. It was refactored, documented, and extended by support for >> user space APIC emulation, both with and without KVM acceleration. The >> VMState format was kept compatible, so was the ABI to the option ROM >> that implements the guest-side para-virtualized driver service. This >> enables seamless migration from qemu-kvm to upstream or, one day, >> between KVM and TCG mode. >> >> The basic concept goes like this: >> - VAPIC PV interface consisting of I/O port 0x7e and (for KVM in-kernel >> irqchip) a vmcall hypercall is registered >> - VAPIC option ROM is loaded into guest >> - option ROM activates TPR MMIO access reporting via port 0x7e >> - TPR accesses are trapped and patched in the guest to call into option >> ROM instead, VAPIC support is enabled >> - option ROM TPR helpers track state in memory and invoke hypercall to >> poll for pending IRQs if required >> >> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx> > > I must say that I find the approach horrible, patching guests and ROMs > and looking up Windows internals. Taking the same approach to extreme, > we could for example patch Xen guest to become a KVM guest. Not that I > object merging. Yes, this is horrible. But there is no real better way in the absence of hardware assisted virtualization of the TPR. I think MS is recommending this patching approach as well. >> diff --git a/hw/apic.c b/hw/apic.c >> index 086c544..2ebf3ca 100644 >> --- a/hw/apic.c >> +++ b/hw/apic.c >> @@ -35,6 +35,10 @@ >> #define MSI_ADDR_DEST_ID_SHIFT 12 >> #define MSI_ADDR_DEST_ID_MASK 0x00ffff0 >> >> +#define SYNC_FROM_VAPIC 0x1 >> +#define SYNC_TO_VAPIC 0x2 >> +#define SYNC_ISR_IRR_TO_VAPIC 0x4 > > Enum, please. OK. > >> + >> static APICCommonState *local_apics[MAX_APICS + 1]; >> >> static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode); >> @@ -78,6 +82,70 @@ static inline int get_bit(uint32_t *tab, int index) >> return !!(tab[i] & mask); >> } >> >> +/* return -1 if no bit is set */ >> +static int get_highest_priority_int(uint32_t *tab) >> +{ >> + int i; >> + for (i = 7; i >= 0; i--) { >> + if (tab[i] != 0) { >> + return i * 32 + fls_bit(tab[i]); >> + } >> + } >> + return -1; >> +} >> + >> +static void apic_sync_vapic(APICCommonState *s, int sync_type) >> +{ >> + VAPICState vapic_state; >> + size_t length; >> + off_t start; >> + int vector; >> + >> + if (!s->vapic_paddr) { >> + return; >> + } >> + if (sync_type & SYNC_FROM_VAPIC) { >> + cpu_physical_memory_rw(s->vapic_paddr, (void *)&vapic_state, >> + sizeof(vapic_state), 0); >> + s->tpr = vapic_state.tpr; >> + } >> + if (sync_type & (SYNC_TO_VAPIC | SYNC_ISR_IRR_TO_VAPIC)) { >> + start = offsetof(VAPICState, isr); >> + length = offsetof(VAPICState, enabled) - offsetof(VAPICState, isr); >> + >> + if (sync_type & SYNC_TO_VAPIC) { >> + assert(qemu_cpu_is_self(s->cpu_env)); >> + >> + vapic_state.tpr = s->tpr; >> + vapic_state.enabled = 1; >> + start = 0; >> + length = sizeof(VAPICState); >> + } >> + >> + vector = get_highest_priority_int(s->isr); >> + if (vector < 0) { >> + vector = 0; >> + } >> + vapic_state.isr = vector & 0xf0; >> + >> + vapic_state.zero = 0; >> + >> + vector = get_highest_priority_int(s->irr); >> + if (vector < 0) { >> + vector = 0; >> + } >> + vapic_state.irr = vector & 0xff; >> + >> + cpu_physical_memory_write_rom(s->vapic_paddr + start, >> + ((void *)&vapic_state) + start, length); > > This assumes that the vapic_state structure matches guest what guest > expect without conversion. Is this true for i386 on x86_64? I didn't > check the structure in question. Yes, the structure in question is a packed one, stable on both guest and host side (the guest side is 32-bit only anyway). >> diff --git a/hw/apic_common.c b/hw/apic_common.c >> index 588531b..1977da7 100644 >> --- a/hw/apic_common.c >> +++ b/hw/apic_common.c >> @@ -20,8 +20,10 @@ >> #include "apic.h" >> #include "apic_internal.h" >> #include "trace.h" >> +#include "kvm.h" >> >> static int apic_irq_delivered; >> +bool apic_report_tpr_access; > > This should go to APICCommonState. Nope, it is a global state, also checked in a place where the APIC is set up, thus have no local clue about it yet and needs to pick up the global view. >> @@ -238,6 +275,7 @@ static int apic_init_common(SysBusDevice *dev) >> { >> APICCommonState *s = APIC_COMMON(dev); >> APICCommonClass *info; >> + static DeviceState *vapic; >> static int apic_no; >> >> if (apic_no >= MAX_APICS) { >> @@ -248,10 +286,29 @@ static int apic_init_common(SysBusDevice *dev) >> info = APIC_COMMON_GET_CLASS(s); >> info->init(s); >> >> - sysbus_init_mmio(&s->busdev, &s->io_memory); >> + sysbus_init_mmio(dev, &s->io_memory); >> + >> + if (!vapic && s->vapic_control & VAPIC_ENABLE_MASK) { >> + vapic = sysbus_create_simple("kvmvapic", -1, NULL); >> + } >> + s->vapic = vapic; >> + if (apic_report_tpr_access && info->enable_tpr_reporting) { > > I think you should not rely on apic_report_tpr_access being in sane > condition during class init. It is mandatory, e.g. for CPU hotplug, as reporting needs to be consistent accross all VCPUs. Therefore it is a static global, set to false initially. However, you are right, we lack proper clearing of the access report feature on reset, not only in this variable. >> diff --git a/hw/kvmvapic.c b/hw/kvmvapic.c >> new file mode 100644 >> index 0000000..0c4d304 >> --- /dev/null >> +++ b/hw/kvmvapic.c >> @@ -0,0 +1,774 @@ >> +/* >> + * TPR optimization for 32-bit Windows guests >> + * >> + * Copyright (C) 2007-2008 Qumranet Technologies >> + * Copyright (C) 2012 Jan Kiszka, Siemens AG >> + * >> + * This work is licensed under the terms of the GNU GPL version 2, or >> + * (at your option) any later version. See the COPYING file in the >> + * top-level directory. >> + */ >> +#include "sysemu.h" >> +#include "cpus.h" >> +#include "kvm.h" >> +#include "apic_internal.h" >> + >> +#define APIC_DEFAULT_ADDRESS 0xfee00000 >> + >> +#define VAPIC_IO_PORT 0x7e >> + >> +#define VAPIC_INACTIVE 0 >> +#define VAPIC_ACTIVE 1 >> +#define VAPIC_STANDBY 2 > > Enums, please. OK. > >> + >> +#define VAPIC_CPU_SHIFT 7 >> + >> +#define ROM_BLOCK_SIZE 512 >> +#define ROM_BLOCK_MASK (~(ROM_BLOCK_SIZE - 1)) >> + >> +typedef struct VAPICHandlers { >> + uint32_t set_tpr; >> + uint32_t set_tpr_eax; >> + uint32_t get_tpr[8]; >> + uint32_t get_tpr_stack; >> +} QEMU_PACKED VAPICHandlers; >> + >> +typedef struct GuestROMState { >> + char signature[8]; >> + uint32_t vaddr; > > This does not look 64 bit clean. It's packed. > >> + uint32_t fixup_start; >> + uint32_t fixup_end; >> + uint32_t vapic_vaddr; >> + uint32_t vapic_size; >> + uint32_t vcpu_shift; >> + uint32_t real_tpr_addr; >> + VAPICHandlers up; >> + VAPICHandlers mp; >> +} QEMU_PACKED GuestROMState; > > Why packed, is this passed to guest directly? It is a data field in the option ROM, see vapic_base in kvmvapic.S. > >> + >> +typedef struct VAPICROMState { >> + SysBusDevice busdev; >> + MemoryRegion io; >> + MemoryRegion rom; >> + bool rom_mapped_writable; > > I'd put this later to avoid a structure hole. Moving it after rom_state may save us a few precious bytes. Well, ok. :) > >> + uint32_t state; >> + uint32_t rom_state_paddr; >> + uint32_t rom_state_vaddr; >> + uint32_t vapic_paddr; >> + uint32_t real_tpr_addr; >> + GuestROMState rom_state; >> + size_t rom_size; >> +} VAPICROMState; >> + >> +#define TPR_INSTR_IS_WRITE 0x1 >> +#define TPR_INSTR_ABS_MODRM 0x2 >> +#define TPR_INSTR_MATCH_MODRM_REG 0x4 >> + >> +typedef struct TPRInstruction { >> + uint8_t opcode; >> + uint8_t modrm_reg; >> + unsigned int flags; >> + size_t length; >> + off_t addr_offset; >> +} TPRInstruction; > > Also here the order is pessimized. Don't see the gain here, though. >> +static int find_real_tpr_addr(VAPICROMState *s, CPUState *env) >> +{ >> + target_phys_addr_t paddr; >> + target_ulong addr; >> + >> + if (s->state == VAPIC_ACTIVE) { >> + return 0; >> + } >> + for (addr = 0xfffff000; addr >= 0x80000000; addr -= TARGET_PAGE_SIZE) { >> + paddr = cpu_get_phys_page_debug(env, addr); >> + if (paddr != APIC_DEFAULT_ADDRESS) { >> + continue; >> + } >> + s->real_tpr_addr = addr + 0x80; >> + update_guest_rom_state(s); >> + return 0; >> + } > > This loop looks odd, what should it do, probe for unused address? Seems to deserve a comment: We have to scan for the guest's mapping of the APIC as we enter here without a hint from an TPR accessing instruction. So we probe the potential range, trying to find the page that maps to that known physical address (known in the sense that Windows does not remap the APIC physically - nor does QEMU support that so far). >> +static int evaluate_tpr_instruction(VAPICROMState *s, CPUState *env, >> + target_ulong *pip, int access) >> +{ >> + const TPRInstruction *instr; >> + target_ulong ip = *pip; >> + uint8_t opcode[2]; >> + uint32_t real_tpr_addr; >> + int i; >> + >> + if ((ip & 0xf0000000) != 0x80000000 && (ip & 0xf0000000) != 0xe0000000) { > > The constants should be using ULL suffix because target_ulong could be > 64 bit, though maybe this is more optimal. target_ulong is 64-bit unconditionally on x86. I'll add this. >> + >> +/* >> + * Tries to read the unique processor number from the Kernel Processor Control >> + * Region (KPCR) of 32-bit Windows. Returns -1 if the KPCR cannot be accessed >> + * or is considered invalid. >> + */ > > Horrible hack. Is guest OS type or version checked somewhere? This is all about hacking Windows 32-bit. And this check encodes that even stronger. The other important binding is the expected virtual address of the ROM mapping under Windows. I would have preferred checking the version directly, but no one has a complete list of supported guests and their codes. > >> +static int get_kpcr_number(CPUState *env) >> +{ >> + struct kpcr { >> + uint8_t fill1[0x1c]; >> + uint32_t self; >> + uint8_t fill2[0x31]; >> + uint8_t number; >> + } QEMU_PACKED kpcr; > > KPCR. Pointers to Windows documentation would be nice. Oops, yes. Unfortunately, this is only an internal structure, not officially documented by MS. However, all supported OS versions a legacy by now, no longer changing its structure. >> + >> +static int patch_hypercalls(VAPICROMState *s) >> +{ >> + target_phys_addr_t rom_paddr = s->rom_state_paddr & ROM_BLOCK_MASK; >> + static uint8_t vmcall_pattern[] = { > > const > >> + 0xb8, 0x1, 0, 0, 0, 0xf, 0x1, 0xc1 >> + }; >> + static uint8_t outl_pattern[] = { > > const Yep. >> +static void vapic_write(void *opaque, target_phys_addr_t addr, uint64_t data, >> + unsigned int size) >> +{ >> + CPUState *env = cpu_single_env; >> + target_phys_addr_t rom_paddr; >> + VAPICROMState *s = opaque; >> + >> + cpu_synchronize_state(env); >> + >> + /* >> + * The VAPIC supports two PIO-based hypercalls, both via port 0x7E. >> + * o 16-bit write access: >> + * Reports the option ROM initialization to the hypervisor. Written >> + * value is the offset of the state structure in the ROM. >> + * o 8-bit write access: >> + * Reactivates the VAPIC after a guest hibernation, i.e. after the >> + * option ROM content has been re-initialized by a guest power cycle. >> + * o 32-bit write access: >> + * Poll for pending IRQs, considering the current VAPIC state. >> + */ > > Different operation depending on size? Interesting. Originally not my idea, just added the third case. :) > >> + switch (size) { >> + case 2: >> + if (s->state != VAPIC_INACTIVE) { >> + patch_hypercalls(s); >> + break; >> + } >> + >> + rom_paddr = (env->segs[R_CS].base + env->eip) & ROM_BLOCK_MASK; >> + s->rom_state_paddr = rom_paddr + data; >> + >> + if (vapic_prepare(s) < 0) { >> + break; >> + } >> + s->state = VAPIC_STANDBY; >> + break; >> + case 1: >> + if (kvm_enabled()) { >> + /* >> + * Disable triggering instruction in ROM by writing a NOP. >> + * >> + * We cannot do this in TCG mode as the reported IP is not >> + * reliable. > > Given the hack level of the whole, it would not be impossible to find > the IP using search PC. Is there a specific pre-existing service you have in mind? Otherwise, the complexity might not be worth the gain. Thanks for having a look, Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html