On Sun, 2022-03-13 at 21:53 +0800, Chao Gao wrote: > On Sun, Mar 13, 2022 at 12:59:36PM +0200, Maxim Levitsky wrote: > > On Sun, 2022-03-13 at 11:19 +0200, Maxim Levitsky wrote: > > > On Fri, 2022-03-11 at 21:28 +0800, Zeng Guang wrote: > > > > On 3/11/2022 12:26 PM, Sean Christopherson wrote: > > > > > On Wed, Mar 09, 2022, Maxim Levitsky wrote: > > > > > > On Wed, 2022-03-09 at 06:01 +0000, Sean Christopherson wrote: > > > > > > > > Could you share the links? > > > > > > > > > > > > > > Doh, sorry (they're both in this one). > > > > > > > > > > > > > > https://lore.kernel.org/all/20220301135526.136554-5-mlevitsk@xxxxxxxxxx > > > > > > > > > > > > > > > > > > > > > > > > > > My opinion on this subject is very simple: we need to draw the line somewhere. > > > > > > > > > > ... > > > > > > > > > > > > > > > Since the goal is to simplify KVM, can we try the inhibit route and see what the > > > > > code looks like before making a decision? I think it might actually yield a less > > > > > awful KVM than the readonly approach, especially if the inhibit is "sticky", i.e. > > > > > we don't try to remove the inhibit on subsequent changes. > > > > > > > > > > Killing the VM, as proposed, is very user unfriendly as the user will have no idea > > > > > why the VM was killed. WARN is out of the question because this is user triggerable. > > > > > Returning an emulation error would be ideal, but getting that result up through > > > > > apic_mmio_write() could be annoying and end up being more complex. > > > > > > > > > > The touchpoints will all be the same, unless I'm missing something the difference > > > > > should only be a call to set an inhibit instead killing the VM. > > > > > > > > Introduce an inhibition - APICV_INHIBIT_REASON_APICID_CHG to deactivate > > > > APICv once KVM guest would try to change APIC ID in xapic mode, and same > > > > sanity check in KVM_{SET,GET}_LAPIC for live migration. KVM will keep > > > > alive but obviously lose benefit from hardware acceleration in this way. > > > > > > > > So how do you think the proposal like this ? > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > > index 6dcccb304775..30d825c069be 100644 > > > > --- a/arch/x86/include/asm/kvm_host.h > > > > +++ b/arch/x86/include/asm/kvm_host.h > > > > @@ -1046,6 +1046,7 @@ struct kvm_x86_msr_filter { > > > > #define APICV_INHIBIT_REASON_X2APIC 5 > > > > #define APICV_INHIBIT_REASON_BLOCKIRQ 6 > > > > #define APICV_INHIBIT_REASON_ABSENT 7 > > > > +#define APICV_INHIBIT_REASON_APICID_CHG 8 > > > > > > > > struct kvm_arch { > > > > unsigned long n_used_mmu_pages; > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > > > index 22929b5b3f9b..66cd54fa4515 100644 > > > > --- a/arch/x86/kvm/lapic.c > > > > +++ b/arch/x86/kvm/lapic.c > > > > @@ -2044,10 +2044,19 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) > > > > > > > > switch (reg) { > > > > case APIC_ID: /* Local APIC ID */ > > > > - if (!apic_x2apic_mode(apic)) > > > > - kvm_apic_set_xapic_id(apic, val >> 24); > > > > - else > > > > + if (apic_x2apic_mode(apic)) { > > > > ret = 1; > > > > + break; > > > > + } > > > > + /* > > > > + * If changing APIC ID with any APIC acceleration enabled, > > > > + * deactivate APICv to avoid unexpected issues. > > > > + */ > > > > + if (enable_apicv && (val >> 24) != apic->vcpu->vcpu_id) > > > > + kvm_request_apicv_update(apic->vcpu->kvm, > > > > + false, APICV_INHIBIT_REASON_APICID_CHG); > > > > + > > > > + kvm_apic_set_xapic_id(apic, val >> 24); > > > > break; > > > > > > > > case APIC_TASKPRI: > > > > @@ -2628,11 +2637,19 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) > > > > static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu, > > > > struct kvm_lapic_state *s, bool set) > > > > { > > > > - if (apic_x2apic_mode(vcpu->arch.apic)) { > > > > - u32 *id = (u32 *)(s->regs + APIC_ID); > > > > - u32 *ldr = (u32 *)(s->regs + APIC_LDR); > > > > - u64 icr; > > > > + u32 *id = (u32 *)(s->regs + APIC_ID); > > > > + u32 *ldr = (u32 *)(s->regs + APIC_LDR); > > > > + u64 icr; > > > > + if (!apic_x2apic_mode(vcpu->arch.apic)) { > > > > + /* > > > > + * If APIC ID changed with any APIC acceleration enabled, > > > > + * deactivate APICv to avoid unexpected issues. > > > > + */ > > > > + if (enable_apicv && (*id >> 24) != vcpu->vcpu_id) > > > > + kvm_request_apicv_update(vcpu->kvm, > > > > + false, APICV_INHIBIT_REASON_APICID_CHG); > > > > + } else { > > > > if (vcpu->kvm->arch.x2apic_format) { > > > > if (*id != vcpu->vcpu_id) > > > > return -EINVAL; > > > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > > > > index 82d56f8055de..f78754bdc1d0 100644 > > > > --- a/arch/x86/kvm/svm/avic.c > > > > +++ b/arch/x86/kvm/svm/avic.c > > > > @@ -931,7 +931,8 @@ bool svm_check_apicv_inhibit_reasons(ulong bit) > > > > BIT(APICV_INHIBIT_REASON_IRQWIN) | > > > > BIT(APICV_INHIBIT_REASON_PIT_REINJ) | > > > > BIT(APICV_INHIBIT_REASON_X2APIC) | > > > > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ); > > > > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | > > > > + BIT(APICV_INHIBIT_REASON_APICID_CHG); > > > > > > > > return supported & BIT(bit); > > > > } > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > > > index 7beba7a9f247..91265f0784bd 100644 > > > > --- a/arch/x86/kvm/vmx/vmx.c > > > > +++ b/arch/x86/kvm/vmx/vmx.c > > > > @@ -7751,7 +7751,8 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit) > > > > ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | > > > > BIT(APICV_INHIBIT_REASON_ABSENT) | > > > > BIT(APICV_INHIBIT_REASON_HYPERV) | > > > > - BIT(APICV_INHIBIT_REASON_BLOCKIRQ); > > > > + BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | > > > > + BIT(APICV_INHIBIT_REASON_APICID_CHG); > > > > > > > > return supported & BIT(bit); > > > > } > > > > > > > > > > > > > > > > > > This won't work with nested AVIC - we can't just inhibit a nested guest using its own AVIC, > > > because migration happens. > > > > I mean because host decided to change its apic id, which it can in theory do any time, > > even after the nested guest has started. Seriously, the only reason guest has to change apic id, > > is to try to exploit some security hole. > > Hi > > Thanks for the information. > > IIUC, you mean KVM applies APICv inhibition only to L1 VM, leaving APICv > enabled for L2 VM. Shouldn't KVM disable APICv for L2 VM in this case? > It looks like a generic issue in dynamically toggling APICv scheme, > e.g., qemu can set KVM_GUESTDBG_BLOCKIRQ after nested guest has started. > That is the problem - you can't disable it for L2, unless you are willing to emulate it in software. Or in other words, when nested guest uses a hardware feature, you can't at some point say to it: sorry buddy - hardware feature disappeared. It is *currently* not a problem for APICv because it doesn't do IPI virtualization, and even with these patches, it doesn't do this for nesting. It does become when you allow nested guest to use this which I did in the nested AVIC code. and writable apic ids do pose a large problem, since nested AVIC, will target L1's apic ids, and when they can change under you without any notice, and even worse be duplicate, it is just nightmare. About KVM_GUESTDBG_BLOCKIRQ - yes, but that is a best effort hack anyway, which not supposed to 100% work - running gdbstub with nested is broken in many ways anyway. Best regards, Maxim Levitsky