Re: [PATCH v2 3/4] KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf

Wanpeng Li <kernellwp@xxxxxxxxx> · Wed, 21 Jun 2017 17:53:18 +0800

2017-06-21 0:12 GMT+08:00 Radim Krčmář <rkrcmar@xxxxxxxxxx>:
> 2017-06-20 05:47+0800, Wanpeng Li:
>> 2017-06-19 22:51 GMT+08:00 Radim Krčmář <rkrcmar@xxxxxxxxxx>:
>> > 2017-06-17 13:52+0800, Wanpeng Li:
>> >> 2017-06-16 23:38 GMT+08:00 Radim Krčmář <rkrcmar@xxxxxxxxxx>:
>> >> > 2017-06-16 22:24+0800, Wanpeng Li:
>> >> >> 2017-06-16 21:37 GMT+08:00 Radim Krčmář <rkrcmar@xxxxxxxxxx>:
>> >> >> > 2017-06-14 19:26-0700, Wanpeng Li:
>> >> >> >> From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>> >> >> >>
>> >> >> >> Add an async_page_fault field to vcpu->arch.exception to identify an async
>> >> >> >> page fault, and constructs the expected vm-exit information fields. Force
>> >> >> >> a nested VM exit from nested_vmx_check_exception() if the injected #PF
>> >> >> >> is async page fault.
>> >> >> >>
>> >> >> >> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>> >> >> >> Cc: Radim Krčmář <rkrcmar@xxxxxxxxxx>
>> >> >> >> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>> >> >> >> ---
>> >> >> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> >> >> >> @@ -452,7 +452,11 @@ EXPORT_SYMBOL_GPL(kvm_complete_insn_gp);
>> >> >> >>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
>> >> >> >>  {
>> >> >> >>       ++vcpu->stat.pf_guest;
>> >> >> >> -     vcpu->arch.cr2 = fault->address;
>> >> >> >> +     vcpu->arch.exception.async_page_fault = fault->async_page_fault;
>> >> >> >
>> >> >> > I think we need to act as if arch.exception.async_page_fault was not
>> >> >> > pending in kvm_vcpu_ioctl_x86_get_vcpu_events().  Otherwise, if we
>> >> >> > migrate with pending async_page_fault exception, we'd inject it as a
>> >> >> > normal #PF, which could confuse/kill the nested guest.
>> >> >> >
>> >> >> > And kvm_vcpu_ioctl_x86_set_vcpu_events() should clean the flag for
>> >> >> > sanity as well.
>> >> >>
>> >> >> Do you mean we should add a field like async_page_fault to
>> >> >> kvm_vcpu_events::exception, then saves arch.exception.async_page_fault
>> >> >> to events->exception.async_page_fault through KVM_GET_VCPU_EVENTS and
>> >> >> restores events->exception.async_page_fault to
>> >> >> arch.exception.async_page_fault through KVM_SET_VCPU_EVENTS?
>> >> >
>> >> > No, I thought we could get away with a disgusting hack of hiding the
>> >> > exception from userspace, which would work for migration, but not if
>> >> > local userspace did KVM_GET_VCPU_EVENTS and KVM_SET_VCPU_EVENTS ...
>> >> >
>> >> > Extending the userspace interface would work, but I'd do it as a last
>> >> > resort, after all conservative solutions have failed.
>> >> > async_pf migration is very crude, so exposing the exception is just an
>> >> > ugly workaround for the local case.  Adding the flag would also require
>> >> > userspace configuration of async_pf features for the guest to keep
>> >> > compatibility.
>> >> >
>> >> > I see two options that might be simpler than adding the userspace flag:
>> >> >
>> >> >  1) do the nested VM exit sooner, at the place where we now queue #PF,
>> >> >  2) queue the #PF later, save the async_pf in some intermediate
>> >> >     structure and consume it at the place where you proposed the nested
>> >> >     VM exit.
>> >>
>> >> How about something like this to not get exception events if it is
>> >> "is_guest_mode(vcpu) && vcpu->arch.exception.nr == PF_VECTOR &&
>> >> vcpu->arch.exception.async_page_fault" since lost a reschedule
>> >> optimization is not that importmant in L1.
>> >>
>> >> @@ -3072,13 +3074,16 @@ static void
>> >> kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>> >>                             struct kvm_vcpu_events *events)
>> >>  {
>> >>      process_nmi(vcpu);
>> >> -    events->exception.injected =
>> >> -        vcpu->arch.exception.pending &&
>> >> -        !kvm_exception_is_soft(vcpu->arch.exception.nr);
>> >> -    events->exception.nr = vcpu->arch.exception.nr;
>> >> -    events->exception.has_error_code = vcpu->arch.exception.has_error_code;
>> >> -    events->exception.pad = 0;
>> >> -    events->exception.error_code = vcpu->arch.exception.error_code;
>> >> +    if (!(is_guest_mode(vcpu) && vcpu->arch.exception.nr == PF_VECTOR &&
>> >> +        vcpu->arch.exception.async_page_fault)) {
>> >> +        events->exception.injected =
>> >> +            vcpu->arch.exception.pending &&
>> >> +            !kvm_exception_is_soft(vcpu->arch.exception.nr);
>> >> +        events->exception.nr = vcpu->arch.exception.nr;
>> >> +        events->exception.has_error_code = vcpu->arch.exception.has_error_code;
>> >> +        events->exception.pad = 0;
>> >> +        events->exception.error_code = vcpu->arch.exception.error_code;
>> >> +    }
>> >
>> > This adds a bug when userspace does KVM_GET_VCPU_EVENTS and
>> > KVM_SET_VCPU_EVENTS without migration -- KVM would drop the async_pf and
>> > a L1 process gets stuck as a result.
>> >
>> > We we'd need to add a similar condition to
>> > kvm_vcpu_ioctl_x86_set_vcpu_events(), so userspace SET doesn't drop it,
>> > but that is far beyond the realm of acceptable code.
>>
>> Do you mean current status of the patchset v2 can be accepted?
>> Otherwise, what's the next should be done?
>
> No, sorry, that one has the migration bug (the async_page_fault gets
> dropped on destination).
>
> You proposed to add the flag to the userspace interface, which is a
> sound solution.  I was asking to look for a different one, because the
> flag is a work-around for an implementation detail, which isn't a good
> thing to put into a userspace interface ...
>
> Still, I looked at the early VM exit (1) and it doesn't fit well into
> SVM's model of single nested VM exit location, so it's out.
>
> The remaining contender is to add a paravirtualized event for apf and
> only convert it into nested VM exit or #PF in inject_pending_event().
> The end result would likely be a slightly better version of the
> exception flag ...
>
> I guess that doing a prototype of the userspace interface extension is a
> good follow up.

Yeah, I just do this in patch 3/4 v3 and another qemu patch. Please
have a review. :)

Regards,
Wanpeng Li