[Android-virt] Partial review of kvm arm git patches.

cdall at cs.columbia.edu (Christoffer Dall) · Thu, 2 Feb 2012 09:48:22 -0500

On Thu, Feb 2, 2012 at 12:04 AM, Rusty Russell <rusty at rustcorp.com.au> wrote:
> Hi all,
>
> ? ? ? ?Started reading through the git tree at
> git://github.com/virtualopensystems/linux-kvm-arm.git (kvm-a15-v6-stage
> branch), and noticed some things. ?I'm learning ARM as I go, so
> apologies in advance for any dumb questions.
>
> Firstly, I want to say that reading this code was a pleasant surprise!
> Many of my comments below are nit-picking...

thanks.

I am a little confused though, did you not already send a lot of these
comments yesterday?

Also, I appreciate very much the review, but remember that I am not
done with this patch series - which is why it has not been sent out
yet but merely pushed to a staging branch.

>
> Psuedo-quoted below:
>
> 38049977d26ac3ca35cf5e18e45d0d54224749af wrote:
>> diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
>> index 0aa8542..1d42907 100644
>> --- a/arch/arm/mach-vexpress/Kconfig
>> +++ b/arch/arm/mach-vexpress/Kconfig
>> @@ -39,6 +39,7 @@ config ARCH_VEXPRESS_CA15X4
>> ? ? ? ? depends on VEXPRESS_EXTENDED_MEMORY_MAP
>> ? ? ? ? select CPU_V7
>> ? ? ? ? select ARM_ARCH_TIMER
>> + ? ? ? select ARM_VIRT_EXT
>> ? ? ? ? select ARM_GIC
>> ? ? ? ? select ARM_GIC_VPPI
>> ? ? ? ? select HAVE_SMP
>> diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
>> index 1a3ca24..5467b28 100644
>> --- a/arch/arm/mm/Kconfig
>> +++ b/arch/arm/mm/Kconfig
>> @@ -640,6 +640,14 @@ config ARM_LPAE
>>
>> ? ? ? ? ? If unsure, say N.
>>
>> +config ARM_VIRT_EXT
>> + ? ? ? bool "Support for ARM Virtualization Extensions"
>> + ? ? ? depends on ARM_LPAE
>> + ? ? ? help
>> + ? ? ? ? Say Y if you have an ARMv7 processor supporting the ARM hardware
>> + ? ? ? ? Virtualization extensions. KVM depends on this feature and will
>> + ? ? ? ? not run without it being selected.
>> +
>
> It's usually a bad idea to SELECT an option which has a prompt, or one
> which has dependencies. ?In this case, ARCH_VEXPRESS_CA15X4 will set
> ARM_VIRT_EXT without ARM_LPAE. ?You need to either select ARM_LPAE in
> ARCH_VEXPRESS_CA15X4, or not select ARM_VIRT_EXT and make that depend on
> ARM_LPAE and ARCH_VEXPRESS_CA15X4, or just make KVM depends on ARM_LPAE.
>
>> diff --git a/arch/arm/include/asm/kvm_para.h b/arch/arm/include/asm/kvm_para.h
>> new file mode 100644
>> index 0000000..7ce5f1c
>> --- /dev/null
>> +++ b/arch/arm/include/asm/kvm_para.h
>> @@ -0,0 +1,9 @@
>> +#ifndef _ASM_X86_KVM_PARA_H
>> +#define _ASM_X86_KVM_PARA_H
>
> I think you mean _ASM_ARM_ here. ?I know you only did this to see who
> was reading carefully :)
>
>> +static inline bool vcpu_mode_priv(struct kvm_vcpu *vcpu)
>> +{
>> + ? ? ? return ((vcpu_mode(vcpu)) == MODE_USR) ? 0 : 1;
>> +}
>
> Why not return vcpu_mode(vcpu) != MODE_USR ? And should MODE_UND be
> privileged?
>
>> +#ifndef __ARM_KVM_TRACE_H__
>> +#define __ARM_KVM_TRACE_H__
>> +
>> +#include <linux/types.h>
>> +#include <linux/kvm_types.h>
>> +#include <linux/kvm_host.h>
>> +
>> +void __kvm_print_msg(char *_fmt, ...);
>> +
>> +#define kvm_err(err, fmt, args...) do { ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? __kvm_print_msg(KERN_ERR "KVM error [%s:%d]: (%d) ", \
>> + ? ? ? ? ? ? ? ? ? ? ? __func__, __LINE__, err); \
>> + ? ? ? __kvm_print_msg(fmt "\n", ##args); \
>> +} while (0)
>> +
>> +#define __kvm_msg(fmt, args...) do { ? ? ? ? ? ? ? ? ? \
>> + ? ? ? __kvm_print_msg(KERN_ERR "KVM [%s:%d]: ", __func__, __LINE__); \
>> + ? ? ? __kvm_print_msg(fmt, ##args); \
>> +} while (0)
>> +
>> +#define kvm_msg(__fmt, __args...) __kvm_msg(__fmt "\n", ##__args)
>> +
>> +
>> +#define KVMARM_NOT_IMPLEMENTED() \
>> +{ \
>> + ? ? ? printk(KERN_ERR "KVM not implemented [%s:%d] in %s\n", \
>> + ? ? ? ? ? ? ? ? ? ? ? __FILE__, __LINE__, __func__); \
>> +}
>
> kvm_host.h already has:
>
> #define pr_unimpl(vcpu, fmt, ...) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> ? ? ? ?pr_err_ratelimited("kvm: %i: cpu%i " fmt, ? ? ? ? ? ? ? ? ? ? ? \
> ? ? ? ? ? ? ? ? ? ? ? ? ? current->tgid, (vcpu)->vcpu_id , ## __VA_ARGS__)
>
> #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
> #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>
> But that should probably be converted to pr_debug(), which gives you
> #ifdef DEBUG and dynamic debug for free.
>
>> +static unsigned long vcpu_reg_offsets[MODE_SYS + 1][16] = {
>> + ? ? ? /* FIQ Registers */
>> + ? ? ? {
>> +
>
> const is preferred these days where possible, so we can put stuff in the
> r/o section.
>
>> +/*
>> + * Return a pointer to the register number valid in the specified mode of
>> + * the virtual CPU.
>> + */
>> +u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode)
>> +{
>> + ? ? ? BUG_ON(reg_num > 15);
>> + ? ? ? BUG_ON(mode > MODE_SYS);
>> +
>> + ? ? ? return (u32 *)((void *)&vcpu->arch + vcpu_reg_offsets[mode][reg_num]);
>> +}
>
> This is pretty neat! ?With only three different cases (?) I might have
> been tempted to use a switch, but this is definitely nicer.
>
>> +#define VM_STAT(x) (offsetof(struct kvm, stat.x), KVM_STAT_VM)
>> +#define VCPU_STAT(x) (offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU)
>> +
>> +struct kvm_stats_debugfs_item debugfs_entries[] = {
>> + ? ? ? { NULL }
>> +};
>
> That's weird. ?I see it's used in x86, not here though.
>
>> +int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>> +{
>> + ? ? ? struct kvm_vcpu_regs *vcpu_regs = &vcpu->arch.regs;
>> +
>> + ? ? ? /*
>> + ? ? ? ?* GPRs and PSRs
>> + ? ? ? ?*/
>> + ? ? ? memcpy(regs->regs0_7, &(vcpu_regs->usr_regs[0]), sizeof(u32) * 8);
>> + ? ? ? memcpy(regs->usr_regs8_12, &(vcpu_regs->usr_regs[8]), sizeof(u32) * 5);
>> + ? ? ? memcpy(regs->fiq_regs8_12, &(vcpu_regs->fiq_regs[0]), sizeof(u32) * 5);
>> + ? ? ? regs->reg13[MODE_FIQ] = vcpu_regs->fiq_regs[5];
>> + ? ? ? regs->reg14[MODE_FIQ] = vcpu_regs->fiq_regs[6];
>> + ? ? ? regs->reg13[MODE_IRQ] = vcpu_regs->irq_regs[0];
>> + ? ? ? regs->reg14[MODE_IRQ] = vcpu_regs->irq_regs[1];
>> + ? ? ? regs->reg13[MODE_SVC] = vcpu_regs->svc_regs[0];
>> + ? ? ? regs->reg14[MODE_SVC] = vcpu_regs->svc_regs[1];
>> + ? ? ? regs->reg13[MODE_ABT] = vcpu_regs->abt_regs[0];
>> + ? ? ? regs->reg14[MODE_ABT] = vcpu_regs->abt_regs[1];
>> + ? ? ? regs->reg13[MODE_UND] = vcpu_regs->und_regs[0];
>> + ? ? ? regs->reg14[MODE_UND] = vcpu_regs->und_regs[1];
>> + ? ? ? regs->reg13[MODE_USR] = vcpu_regs->usr_regs[0];
>> + ? ? ? regs->reg14[MODE_USR] = vcpu_regs->usr_regs[1];
>
> Can we use the vcpu_reg_offsets[] logic here somehow, rather than
> open-coding the mapping again? ?Maybe not worth it...
>
> In 5cbffd9ca63ece23593e11eb0cdb1a937d398a0c:
>> -static void identity_mapping_add(pgd_t *pgd, unsigned long addr, unsigned long
>> +static void __identity_mapping_add(pgd_t *pgd, unsigned long addr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long end, bool hyp_mapping)
>> ?{
>> ? ? ? ? unsigned long prot, next;
>>
>> ? ? ? ? prot = PMD_TYPE_SECT | PMD_SECT_AP_WRITE | PMD_SECT_AF;
>> +
>> +#ifdef CONFIG_ARM_LPAE
>> + ? ? ? if (hyp_mapping)
>> + ? ? ? ? ? ? ? prot |= PMD_SECT_AP1;
>> +#endif
>> +
>> ? ? ? ? if (cpu_architecture() <= CPU_ARCH_ARMv5TEJ && !cpu_is_xscale())
>> ? ? ? ? ? ? ? ? prot |= PMD_BIT4;
>>
>> @@ -75,6 +83,11 @@ static void identity_mapping_add(pgd_t *pgd, unsigned long ad
>>
>> ?extern char ?__idmap_text_start[], __idmap_text_end[];
>>
>> +static void identity_mapping_add(pgd_t *pgd, unsigned long addr, unsigned long
>> +{
>> + ? ? ? __identity_mapping_add(pgd, addr, end, false);
>> +}
>> +
>> ?static int __init init_static_idmap(void)
>> ?{
>> ? ? ? ? phys_addr_t idmap_start, idmap_end;
>
> Since this only has one, internal caller, introducing this indirection
> is a bit weird. ?How about just changing the prototype of
> identity_mapping_add and the one caller?
>
>> +/*
>> + * This version actually frees the underlying pmds for all pgds in range and
>> + * clear the pgds themselves afterwards.
>> + */
>> +void hyp_identity_mapping_del(pgd_t *pgd, unsigned long addr, unsigned long end
>
> s/del/free/ maybe? ?Even truncate the names to hyp_idmap_add / hyp_idmap_free?
>
> In 23c09018329da88b36cad96a197420a08c9542f2:
>> ? ? By setting the HVBAR the kernel can execute code in Hyp-mode with
>> ? ? the MMU disabled. The HVBAR initially points to initialization code,
>> ? ? which initializes other Hyp-mode registers and enables the MMU
>> ? ? for Hyp-mode. Afterwards, the HVBAR is changed to point to KVM
>> ? ? Hyp vectors used to catch guest faults and to switch to Hyp mode
>> ? ? to perform a world-switch into a KVM guest.
>
> This overview really needs to be in the code somewhere, not (just) the
> commit message.
>

yeah, but this is subject to change, so....

>> + ? ? ? /*
>> + ? ? ? ?* Allocate stack pages for Hypervisor-mode
>> + ? ? ? ?*/
>> + ? ? ? for_each_possible_cpu(cpu) {
>> + ? ? ? ? ? ? ? void *stack_page;
>> +
>> + ? ? ? ? ? ? ? stack_page = (void *)__get_free_page(GFP_KERNEL);
>> + ? ? ? ? ? ? ? if (!stack_page) {
>> + ? ? ? ? ? ? ? ? ? ? ? err = -ENOMEM;
>> + ? ? ? ? ? ? ? ? ? ? ? goto out_free_pgd;
>> + ? ? ? ? ? ? ? }
>
> This leaks memory on error. ?You need to free the pages of already-done
> cpus. ?Since free_page() handles 0 address as noop, this is pretty easy
> to fix though.
>
>> +static void cpu_set_vector(void *vector)
>> +{
>> + ? ? register unsigned long vector_ptr asm("r0");
>> + ? ? register unsigned long smc_hyp_nr asm("r7");
>> +
>> + ? ? vector_ptr = (unsigned long)vector;
>> + ? ? smc_hyp_nr = SMCHYP_HVBAR_W;
>> +
>> + ? ? /*
>> + ? ? ?* Set the HVBAR
>> + ? ? ?*/
>> + ? ? asm volatile (
>> + ? ? ? ? ? ? "smc ? ?#0\n\t" : :
>> + ? ? ? ? ? ? [vector_ptr] "r" (vector_ptr),
>> + ? ? ? ? ? ? [smc_hyp_nr] "r" (smc_hyp_nr));
>> +}
>
> Ah, right, the bootloader sets a Secure Mode stub to allow us to change
> HVBAR. ?And this stub only has the SMCHYP_HVBAR_W function so far. ?Is
> there some standard here, or is it just made up? ?AFAICT it doesn't
> indicate failture if you hand it a different function, does it?
>
> What are the alternatives? ?Can we put the monitor vector somewhere the
> OS can change it? ?That would be nice, because then we can do
> *anything*...
>
>> + ? ? cpu_set_vector(vector);
>> +
>> + ? ? barrier();
>> +
>> + ? ? pgd_ptr = virt_to_phys(kvm_hyp_pgd);
>
> What's the barrier() for?
>

nothing - it's a staging branch.

I was debugging a boot issue, and it turned out that I forgot to save
r0 in the init code.

>> + ? ? asm volatile (
>> + ? ? ? ? ? ? "hvc ? ?#0\n\t" : :
>> + ? ? ? ? ? ? [pgd_ptr] "r" (pgd_ptr),
>> + ? ? ? ? ? ? [stack_ptr] "r" (hyp_stack_ptr));
>
> This bouncing into PL2 all the time seems a bit awkward. ?I assume Stuff
> Breaks if we try to stay in PL2 all the time?
>
>> ?void kvm_arch_exit(void)
>> ?{
>> + ? ? if (kvm_hyp_pgd) {
>> + ? ? ? ? ? ? free_hyp_pmds(kvm_hyp_pgd);
>> + ? ? ? ? ? ? kfree(kvm_hyp_pgd);
>> + ? ? ? ? ? ? kvm_hyp_pgd = NULL;
>> + ? ? }
>> ?}
>
> If we made it through init, kvm_hyp_pgd must be non-NULL.
>

ok

>> + ? ? addr = PAGE_OFFSET;
>> + ? ? end = ~0;
>> + ? ? do {
>> + ? ? ? ? ? ? next = pgd_addr_end(addr, end);
>> + ? ? ? ? ? ? pgd = hyp_pgd + pgd_index(addr);
>> + ? ? ? ? ? ? pud = pud_offset(pgd, addr);
>> +
>> + ? ? ? ? ? ? BUG_ON(pud_bad(*pud));
>> +
>> + ? ? ? ? ? ? if (pud_none(*pud))
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> +
>> + ? ? ? ? ? ? pmd = pmd_offset(pud, addr);
>> + ? ? ? ? ? ? free_ptes(pmd, addr);
>> + ? ? ? ? ? ? pmd_free(NULL, pmd);
>> + ? ? } while (addr = next, addr != end);
>
> All your loops are like this, but I think a for () is sufficient.
> I don't think any of them are called start == end anyway, and it
> wouldn't matter if they were...
>

I don't understand this point. Why is a for() better. This scheme is
used elsewhere in the kernel as well.

> In 690593d1a6a98acba7d672dd3717f329e0e81bc9:
>
>> This commit introduces the framework for guest memory management
>> through the use of 2nd stage translation. Each VM has a pointer
>> to a level-1 tabled (the pgd field in struct kvm_arch) which is
>
> s/tabled/table/.
>

ack

>> ? ? Note: The custom trace_kvm_irq_line is used despite a generic definition of
>> ? ? trace_kvm_set_irq, since the trace-Kvm_set_irq depends on the x86-specific
>> ? ? define of __HAVE_IOAPIC. Either the trace event should be created
>> ? ? regardless of this define or it should depend on another ifdef clause,
>> ? ? common for both x86 and ARM. However, since the arguments don't really
>> ? ? match those used in ARM, I am yet to be convinced why this is necessary.
>
> Not true in this patch any more?
>

nope, not true anymore - staging branch :)

>> +#define VMID_FIRST_GENERATION ? ? ? ?(1 << VMID_BITS)
> ...
>> + ? ? next_vmid = VMID_FIRST_GENERATION;
>
> This looks weird; will revisit when I see how it's used.
>
> Ah, OK. ?Split counter, top 24 bits is the generation counter.
> Generation 0 is never used, so a 0 vmid (as it's initialized) is always
> considered obsolete and forces a new one to be allocated.
>
> A clue here might help?
>

sure

> In 6faedb12c0f4eec1370f0bfa6d2a651da2a30cf6:
>> This is documented in Documentation/kvm/api.txt.
>
> Now Documentation/virtual/kvm/api.txt.
>
>> +ARM uses two types of interrupt lines per CPU: IRQ and FIQ. When not using
>> +in-kernel GIC emulation, it is required to create an IRQ chip with chip_id=0.
>
> Idiomatic: s/it is required to/you must/. ?Or "creating an IRQ chip with
> chip_id=0 is required".
>

ok

> And I assume there is no in-kernel GIC emulation yet?
>

correct.

>> +
>> + ? ? /* In-kernel GIC-model: 0=none */
>> + ? ? int ? ?gic;
>> };
>
> And importantly, -1 = undefined, AFAICT.
>

yes, true

>> @@ -99,8 +102,7 @@ struct kvm_vcpu_arch {
>> ? ? ? u32 mmio_rd;
>>
>> ? ? ? /* Interrupt related fields */
>> - ? ? atomic_t irq_lines; ? ? /* IRQ and FIQ levels */
>> - ? ? u32 virt_irq; ? ? ? ? ? /* HCR exception mask */
>> + ? ? u32 irq_lines; ? ? ? ? ?/* IRQ and FIQ levels */
>> ? ? ? u32 wait_for_interrupts;
>> ?};
>
> This needs folding into a previous patch?
>

possibly.

>> + ? ? ? ? ? ? struct kvm_irqchip *chip = kmalloc(sizeof(*chip), GFP_KERNEL);
>> +
>> + ? ? ? ? ? ? r = -ENOMEM;
>> + ? ? ? ? ? ? if (!chip)
>> + ? ? ? ? ? ? ? ? ? ? goto out_get_irqchip;
>> + ? ? ? ? ? ? r = -ENXIO;
>> + ? ? ? ? ? ? if (kvm->arch.gic < 0)
>> + ? ? ? ? ? ? ? ? ? ? goto out_get_irqchip;
>> + ? ? ? ? ? ? memset(chip, 0, sizeof(*chip));
>
> kzalloc instead of kmalloc & memset?
>

yeah

>> diff --git a/arch/arm/kvm/trace.h b/arch/arm/kvm/trace.h
>> index f8869c1..f797dfd 100644
>> --- a/arch/arm/kvm/trace.h
>> +++ b/arch/arm/kvm/trace.h
>> @@ -39,8 +39,6 @@ TRACE_EVENT(kvm_exit,
>> ? ? ? TP_printk("PC: 0x%08lx", __entry->vcpu_pc)
>> ?);
>>
>> -
>> -
>> ?#endif /* _TRACE_KVM_H */
>>
>> ?#undef TRACE_INCLUDE_PATH
>
> Stray whitespace diff...
>

whoops, I'm going to play my staging branch card once more...

>> diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h
>> index 4d8dcbd..05e3466 100644
>> --- a/arch/x86/include/asm/kvm.h
>> +++ b/arch/x86/include/asm/kvm.h
>> @@ -12,6 +12,7 @@
>> ?/* Select x86 specific features in <linux/kvm.h> */
>> ?#define __KVM_HAVE_PIT
>> ?#define __KVM_HAVE_IOAPIC
>> +#define __KVM_HAVE_IRQ_LINE
>> ?#define __KVM_HAVE_DEVICE_ASSIGNMENT
>> ?#define __KVM_HAVE_MSI
>> ?#define __KVM_HAVE_USER_NMI
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 68e67e5..bdbc6b8 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -111,6 +111,7 @@ struct kvm_irq_level {
>> ? ? ? ?* ACPI gsi notion of irq.
>> ? ? ? ?* For IA-64 (APIC model) IOAPIC0: irq 0-23; IOAPIC1: irq 24-47..
>> ? ? ? ?* For X86 (standard AT mode) PIC0/1: irq 0-15. IOAPIC0: 0-23..
>> + ? ? ?* For ARM: IRQ: irq = (2*vcpu_index). FIQ: irq = (2*vcpu_indx + 1).
>> ? ? ? ?*/
>> ? ? ? union {
>> ? ? ? ? ? ? ? __u32 irq;
>> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
>> index 46e3cd8..ae9779d 100644
>> --- a/include/trace/events/kvm.h
>> +++ b/include/trace/events/kvm.h
>> @@ -36,7 +36,7 @@ TRACE_EVENT(kvm_userspace_exit,
>> ? ? ? ? ? ? ? ? __entry->errno < 0 ? -__entry->errno : __entry->reason)
>> ?);
>>
>> -#if defined(__KVM_HAVE_IOAPIC)
>> +#if defined(__KVM_HAVE_IRQ_LINE)
>> ?TRACE_EVENT(kvm_set_irq,
>> ? ? ? TP_PROTO(unsigned int gsi, int level, int irq_source_id),
>> ? ? ? TP_ARGS(gsi, level, irq_source_id),
>> @@ -56,7 +56,9 @@ TRACE_EVENT(kvm_set_irq,
>> ? ? ? TP_printk("gsi %u level %d source %d",
>> ? ? ? ? ? ? ? ? __entry->gsi, __entry->level, __entry->irq_source_id)
>> ?);
>> +#endif
>>
>> +#if defined(__KVM_HAVE_IOAPIC)
>> ?#define kvm_deliver_mode ? ? ? ? ? ? \
>> ? ? ? {0x0, "Fixed"}, ? ? ? ? ? ? ? ? \
>> ? ? ? {0x1, "LowPrio"}, ? ? ? ? ? ? ? \
>
> This deserves its own patch, I think.
>
> In 397e0cbc06b8283561e7ac693732e93fe1aeb667:
>> ? ? Provides complete world-switch implementation to switch to other guests
>> ? ? runinng in non-secure modes. Includes Hyp exception handlers that
>
> s/runinng/running/.
>

ack

>> ? ? captures necessary exception information and stores the
>> ? ? information on
>
> s/captures/capture/
>

ack

>> ? ? Switching to Hyp mode is done through a simple HVC instructions. The
>> ? ? exception vector code will check that the HVC comes from VMID==0 and if
>> ? ? so will store the necessary state on the Hyp stack, which will look like
>> ? ? this (see hyp_hvc):
>> ? ? ? ...
>> ? ? ? Hyp_Sp + 4: lr_usr
>> ? ? ? Hyp_Sp ? ?: spsr (Host-SVC cpsr)
>>
>> ? ? When returning from Hyp mode to SVC mode, another HVC instruction is
>> ? ? executed from Hyp mode, which is taken in the Hyp_Svc handler. The Hyp
>> ? ? stack pointer should be where it was left from the above initial call,
>> ? ? since the values on the stack will be used to restore state (see
>> ? ? hyp_svc).
>
> Again, this information really belongs somewhere in the code.
>

I agree

>> diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
>> index 1429d89..50bf48d 100644
>> --- a/arch/arm/kernel/asm-offsets.c
>> +++ b/arch/arm/kernel/asm-offsets.c
>> @@ -16,6 +16,7 @@
>> ?#include <asm/cacheflush.h>
>> ?#include <asm/glue-df.h>
>> ?#include <asm/glue-pf.h>
>> +#include <linux/kvm_host.h>
>> ?#include <asm/mach/arch.h>
>> ?#include <asm/thread_info.h>
>> ?#include <asm/memory.h>
>
> Hmm, perhaps put this up next to the other linux/ includes?
>

sure, why not.

>> + ? ? /*
>> + ? ? ?* ?Check that the VMID is still valid.
>> + ? ? ?* ?(The hardware supports only 256 values with the value zero
>> + ? ? ?* ? reserved for the host, so we check if an assigned value has rolled
>> + ? ? ?* ? over a sequence, which requires us to assign a new value and flush
>> + ? ? ?* ? necessary caches and TLBs on all CPUs.)
>> + ? ? ?*/
>> + ? ? if (unlikely((kvm->arch.vmid ^ next_vmid) >> VMID_BITS))
>> + ? ? {
>
> Brace position, but otherwise OK....
>

ah

>> + ? ? ? ? ? ? if (unlikely((next_vmid & VMID_MASK) == 0)) {
>
> So, this covers the 8-bit wrap case...
>
>> + ? ? ? ? ? ? ? ? ? ? if (unlikely((next_vmid >> VMID_BITS) == 0))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? next_vmid = VMID_FIRST_GENERATION;
>
> And this covers the 32 bit wrap case. ?It's really a complicated way of
> saying if (next_vmid == 0), right?
>

it covers the 32 bit wrap case, and yeah, it's like checking for zero.
I don't think it's really that complicated, but I don't feel strongly
about it, so I can change.

> But I think you have to do something more explicit here: 4bn context
> switches isn't that many, and a malicious guest could wait that out,
> couldn't it? ?And with two malicious guests, it might be feasible to hit
> the collision. ?A 64 bit counter would fix this. ?Or reset all the
> guests' vmids to 0 when we wrap?
>

that's true. It's not exactly 4bn context switches, but rather you
need more than 255 VMs that need to context switch in order for it to
be 4bn. Also, if the VMs stay in the guest, ie. the malicious guests,
then they will be flushed. So it's malicious users on the host.

Nevertheless, it should be fixed. Since this is in the context switch
path I would avoid the 64 bit counter, and rather go and reset the
vmid to 0 on the VMs with non-active vcpu's.

>> +@ This must be called from Hyp mode!
>> +@ Arguments:
>> +@ ?r0: pointer to vcpu struct
>> ?ENTRY(__kvm_vcpu_run)
>> + ? ? hvc ? ? #0 ? ? ? ? ? ? ? ? ? ? ?@ Change to Hyp-mode
>
> "This must be called from Hyp mode!" is wrong...
>

indeed. nice catch.

>> + ? ? hvc ? ? #0
>> +
>> + ? ? cmp ? ? r0, #ARM_EXCEPTION_IRQ
>> + ? ? bne ? ? return_to_ioctl
>
> I only know what that hvc #0 does because of the commit msg.
> How about a hint:
>
> ? ? ? ?hvc ? ? #0 ? ? ?@ Return from Hyp-mode, see hvc_svc
>

yeah, people have actually been getting this wrong so you're right.

> The rest was really well commented.
>

thanks

> In 67ba416aa6a0316b60ff1be30b87635a75ae0acc:
>> + ? ? ? ? ? ? if (signal_pending(current)) {
>> + ? ? ? ? ? ? ? ? ? ? if (!run->exit_reason) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = -EINTR;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? run->exit_reason = KVM_EXIT_INTR;
>
> The !run->exit_reason becomes clearer later, but it'd be neater
> as an explicit "if (run->exit_reason == KVM_EXIT_UNKNOWN)".
>

ok.

>> + ? ? if (exception_index != ARM_EXCEPTION_HVC) {
>> + ? ? ? ? ? ? kvm_err(-EINVAL, "Unsupported exception type");
>> + ? ? ? ? ? ? return -EINVAL;
>> + ? ? }
>
> You really want to print out exception_index here. ?pr_unimpl?
>

that's right.

>> +static inline void print_cp_instr(struct coproc_params *p)
>> +{
>
> As a general rule, I prefer not to see "inline" in C files. ?If the
> function ever becomes unused, it's nice if gcc tells you. ?Though maybe
> here that's a feature?
>

just a hint to the compiler to inline. It will probably do that
anyway, and I like your argument, so sure.

Is the inline hint completely ignored by the compiler in terms of
actually inlining or not?

>> + ? ? if ((p->CRm == 0 || p->CRm == 1 || p->CRm == 4 || p->CRm == 8) &&
>> + ? ? ? ? (p->Op2 <= 7)) {
>> + ? ? ? ? ? ? /* TLB Lockdown operations - ignored */
>> + ? ? ? ? ? ? return 0;
>> + ? ? }
>
> It would be nice to emulate this accurately: AFAICT some of these could
> be r/o registers, where we should trap.
>

eh, I don't know. I don't think we want the guest to be able to
actually lock anything down in the TLB. That would be for some funny
guest RT support perhaps, but not right now.

Yes, they could trap, but I doubt sane guests rely on this feature in
any way, so this gets really low priority on my list.

> And isn't (p->op2 <= 7) always true?
>

yes in the instruction, but the field could have been set incorrectly
by some other code, and then I'd rather catch that error.

>> + ? ? if (p->CRm == 2 && p->Op2 == 0) {
>> + ? ? ? ? ? ? cp15_op(vcpu, p, c10_PRRR);
>> + ? ? ? ? ? ? return 0;
>> + ? ? }
>
> p->Op1 must be 0 as well, right?
>

yes, nice catch

>> + ? ? if (p->CRm == 2 && p->Op2 == 1) {
>> + ? ? ? ? ? ? cp15_op(vcpu, p, c10_NMRR);
>> + ? ? ? ? ? ? return 0;
>> + ? ? }
>
> Here too... ?And both these assume the guest is !LPAE. ?Fine for the
> moment, but it'd be nice to put the condition in here already.

well, we should check for LPAE actually.

>
> The accretion of loose emulation leads to QEMU-style emulation, where
> you can't distinguish sloppiness from deliberate hacks, and you're
> terrified to clean up anything because some weird setup will break. ?I'd
> really rather see us emulating as tightly as possible.
>

Ok. There should be no deliberate hacks in there. So anything
imprecise is sloppiness or at least something incomplete.

So I am not sure what you are suggesting. Something completely
different or just fixing up the emulation code?

No one has really looked at the emulation code before now except than
me, and my implementation has been based on booting a guest with
severely limited test cycles, so...

> (And unit testing our emulation code, but that's another topic).
>

If you want to come up with a nice framework for this, be my guest ;)

> In a8ba44212b056eb174beeb3ac9d99205ec7c4184:
>> @@ -228,9 +251,11 @@ static int emulate_cp15_c10_access(struct kvm_vcpu *vcpu,
>> ? * @vcpu: The VCPU pointer
>> ? * @p: ? ?The coprocessor parameters struct pointer holding trap inst. details
>> ? *
>> - * The CP15 c15 register is implementation defined, but some guest kernels
>> - * attempt to read/write a diagnostics register here. We always return 0 and
>> - * ignore writes and hope for the best. This may need to be refined.
>> + * The CP15 c15 register is architecturally implementation defined, but some
>> + * guest kernels attempt to read/write a diagnostics register here. We always
>> + * return 0 and ignore writes and hope for the best.
>> + *
>> + * This may need to be refined.
>> ? */
>> ?static int emulate_cp15_c15_access(struct kvm_vcpu *vcpu,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct coproc_params *p)
>
> This hunk belongs in a previous patch.
>

yeah, staging branch, again again.

>> + ? ? if (kvm_read_guest(vcpu->kvm, pc_ipa, &instr, sizeof(instr))) {
>> + ? ? ? ? ? ? kvm_err(-EFAULT, "Could not copy guest instruction");
>> + ? ? ? ? ? ? return -EFAULT;
>> + ? ? }
>> +
>> + ? ? if (vcpu->arch.regs.cpsr & PSR_T_BIT) {
>> + ? ? ? ? ? ? /* Need to decode thumb instructions as well */
>> + ? ? ? ? ? ? KVMARM_NOT_IMPLEMENTED();
>> + ? ? ? ? ? ? return -EINVAL;
>> + ? ? }
>> +
>> + ? ? return kvm_emulate_mmio_ls(vcpu, fault_ipa, instr);
>
> We could punt this nasty instruction decoding to userspace with a magic
> len=0 value or something, right? ?Or does it happen often? ?Or will we
> have to do it for GIC emulation later or something?
>

It happens more often than I'd hoped so I would like to keep this
in-kernel. It has been thoroughly tested actually, and adding
thumb-support is a small thing, so I think it should stay.

Let me fix thumb-mode here and let it try its way through a couple of
more patch reviews.

>> +static int kvm_ls_length(struct kvm_vcpu *vcpu, u32 instr)
>> +{
>> + ? ? int index;
>> +
>> + ? ? index = get_arm_ls_instr_index(instr);
>> + ? ? BUG_ON(index == INSTR_NONE);
>
> Technically not a bug (we checked the get_arm_ls_instr_index() in the
> caller), but we don't use index in this function, and it would be
> clearer to pass it in anyway.
>

no, let's remove the check here completely.

> We wouldn't want this assumption to move around: an SMP guest can
> certainly feed in bogus instructions, as can a malicious kvm process.
>

hmmm, nice. Yeah, let's not let them crash the kernel in this case,
but rather destroy themselves.

>> +/* Architecturally implementation defined CP15 register access */
>
> Belongs in previous patch, too.
>

ahem.

> In a8ba44212b056eb174beeb3ac9d99205ec7c4184:
>> @@ -449,8 +449,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> ? ? ? int ret;
>>
>> ? ? ? for (;;) {
>> - ? ? ? ? ? ? trace_kvm_entry(vcpu->arch.regs.pc);
>> + ? ? ? ? ? ? if (run->exit_reason == KVM_EXIT_MMIO) {
>> + ? ? ? ? ? ? ? ? ? ? ret = kvm_handle_mmio_return(vcpu, vcpu->run);
>> + ? ? ? ? ? ? ? ? ? ? if (ret)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? run->exit_reason = KVM_EXIT_UNKNOWN;
>>
>> + ? ? ? ? ? ? trace_kvm_entry(vcpu->arch.regs.pc);
>
> This works, but only because we always break out of the loop if
> ->exit_reason == KVM_EXIT_UNKNOWN. ?It would be clearer to pull the test
> out of the loop:

eh, we don't break out of the loop if it's KVM_EXIT_UNKNOWN. Oh, you
mean EXIT_MMIO.

yes, indeed that can go out of the loop.

>
> ? ? ? ?int ret;
>
> ? ? ? ?if (run->exit_reason == KVM_EXIT_MMIO) {
> ? ? ? ? ? ? ? ?ret = kvm_handle_mmio_return(vcpu, vcpu->run);
> ? ? ? ? ? ? ? ?if (ret)
> ? ? ? ? ? ? ? ? ? ? ? ?return ret;
> ? ? ? ?}
>
> ? ? ? ?for (;;) {
>
>

yeah, there you go.

> In c6632f26d712b62f7639c416540fc46dd220d260:
>> + ? ? ? ? ? ? if (vcpu->arch.wait_for_interrupts)
>> + ? ? ? ? ? ? ? ? ? ? goto wait_for_interrupts;
>> +
> ...
>> -
>> +wait_for_interrupts:
>> ? ? ? ? ? ? ? if (signal_pending(current)) {
>> ? ? ? ? ? ? ? ? ? ? ? if (!run->exit_reason) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ret = -EINTR;
>> @@ -491,6 +502,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> ? ? ? ? ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? }
>> +
>> + ? ? ? ? ? ? if (vcpu->arch.wait_for_interrupts)
>> + ? ? ? ? ? ? ? ? ? ? kvm_vcpu_block(vcpu);
>> ? ? ? }
>
> How about doing all the checking at the top of the loop, with a continue?
>
> ? ? ? ?/* Things which we check before running. */
> ? ? ? ?if (need_resched())
> ? ? ? ? ? ? ? ?kvm_resched(vcpu);
>
> ? ? ? ?if (signal_pending(current)) {
> ? ? ? ? ? ? ? ?...
> ? ? ? ?}
>
> ? ? ? ?if (vcpu->arch.wait_for_interrupts) {
> ? ? ? ? ? ? ? ?kvm_vcpu_block(vcpu);
> ? ? ? ? ? ? ? ?continue;
> ? ? ? ?}
>
> ? ? ? ?/* Now nothing can stop us! */
> ? ? ? ?trace_kvm_entry(vcpu->arch.regs.pc);
> ? ? ? ?...
>
> This has the added bonus that the for(;;) can now be "do { ... } while
> (run->exit_reason == KVM_EXIT_MMIO);" which reflects the non-error path
> neatly.

I don't think it's clearer, but it's personal taste of course. The
thing that I prefer with the current implementation is that you have a
feel for what you want to check as a result of the vcpu having run for
a while - as opposed to something you wish to check directly on entry
to the ioctl - these cases will almost always happen in at least the
second iteration. Also, it looks more like the implementation of the
other archs, which I somehow like.

>
> I look forward to your responses, esp. where you explain what I got
> wrong!
>

thanks for the comprehensive review.

Looking forward for more to come. I will try to release a v6 series
real soon, so you don't have to spend time on stuff that is merged in
the wrong patch etc.

-Christoffer