Re: [PATCH 0/4] KVM: nVMX: Fix pending event injection on nested scenarios

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 07/11/17 16:25, Paolo Bonzini wrote:
On 05/11/2017 15:07, Liran Alon wrote:
Hi,

This series of patches aim to fix multiple related issues with how
pending event injection works on nVMX.

The first patch fixes a simple error in the return-value of
vmx_check_nested_events(). Next patch relies on it to correctly
determine when an immediate-exit is required from L2 to L1.

The second patch fixes a critical bug which caused L1 to miss an IPI
in case it was sent when destination CPU exited from L2 to L0 due to
event-delivery.

The third patch removes a now reduntant signaling of KVM_REQ_EVENT.
This actually masked the issue fixed in the previous patch.

The fourth patch fixes an issue of not always syncing PIR to L1
Virtual-APIC-Page when KVM_REQ_EVENT is signaled. This patch relies
on vmx_check_nested_events() always being called when KVM_REQ_EVENT is
set which is true since the second patch.

With all the discussions on the other series, I didn't reply here.  I
haven't commented yet because I want to see first of all whether we have
coverage in kvm-unit-tests of the issue that the first two patches fix
(i.e., does something break in kvm-unit-tests if I only apply patch 3).
I don't think there is a kvm-unit-test which should cover this as just by looking at x86/unittests.cfg at all tests which have a "smp" parameter, it seems none of them relates to nVMX. Because the bug nature requires at least 2 L1 CPUs, I think this is not covered. I actually have a couple of additional patches that relates to race-conditions in nVMX. I will send them soon.

I thought about writing a unit-test for this issue but it is a complex race-condition and every test I thought about would be probabilistic. As you will need a L1 IPI to be sent to dest CPU that is exactly at the moment of exiting from L2 to L0 due to event-delivery but not because of L1 (you want to avoid VMExit to be forwarded from L0 to L1 but rather resume again to L2. For example, making sure that nEPT will have L2 IDT-descriptor unmapped but is mapped correctly in L1 EPT.)

I would be happy for ideas on how to make such a unit-test not probabilistic (Besides repeating it large amount of iterations in hope that chance not to reproduce is very small).

The unit-test I thought about goes something like this:
1. CPU A sends X L1 IPIs to CPU B. The IPI handler in CPU B will increment a var which documents amount of times the IPI handler actually got to dest CPU. Before each time CPU A send an IPI, it will busy-loop to see var indeed incremented (L1 IPI Ack).
2. CPU B will run a loop of Y iterations of the following:
a) Setup L1 EPT that maps memory 1-to-1. Including memory that will be used for L2 IDT. b) Before entering L2, execute INVEPT on L2 IDT memory such that L0 nEPT will remove it's mapping for it. (As done for Shadow MMU when running "invlpg" instruction) c) Enter L2 and perform an "int $x" instruction. This should exit from L2 to L0 due to event-delivery. After L2's interrupt-handler returns, exit to L1 (by using vmcall) and repeat from step a.

Also, I had some incomplete work that eliminates
vmx_inject_page_fault_nested (which IMO is only papering over bugs,
because anything it does should be covered in theory by
nested_vmx_inject_exception_vmexit).  I'm curious if patches 1-2 help
there too.
I'm probably missing something but I fail to see how patches 1-2 could help there. Can you explain?

However, all this is going to be work for 4.16.
I am sorry for the long respond but I think a bit of context here will help discussion a lot.

Oracle Ravello (which I work at) is a cloud provider running on top of cloud providers. When we run on top of Oracle OCI cloud, we run customer VMs using KVM nested virtualization. The issue at hand here was discovered from *real production* use-cases and was rare until we managed to find a simple single use-case which reproduced ~100% (I described the setup which reproduce on commit message with hope that others can reproduce it as-well). After issue was reproduced, we were able to pin-point the root-cause and verify the fix works (as it reproduced ~100%). Also it was clear from KVM trace-pipe that this is indeed the issue.

In addition, in inject_pending_event() one can see a comment referring to: https://lkml.org/lkml/2014/7/2/60 Following the thread, one can see that an attempt was made to only set KVM_REQ_EVENT when a L1 event was actually blocked instead of when nested_run_pending=true. However, Bandan Das then notes this change caused L1 to be stuck on smp_call_function_many(). In addition, looking at the bug discussion the thread claims to fix (https://bugzilla.kernel.org/show_bug.cgi?id=72381), one can see at the attached L1 serial the exact stack-traces that we saw in the issue I fixed: All L1 CPUs attempting to grab mmu_lock while one CPU is holding the mmu_lock and is currently stuck waiting for ACK on some IPI. At the end of patch-discussion, it was decided to fix the issue (correctly I must say) by adding an extra call to check_nested_events() in inject_pending_event(). However, the original setting of KVM_REQ_EVENT on nested_run_pending=true remained as no-one in thread was able to pin-point root-cause of that bug.

I believe that my commit solves exactly the bug mentioned there by Bandan and therefore the third patch in series now removes the setting of KVM_REQ_EVENT on nested_run_pending=true from code.

Adding to CC the guys who worked on the above mentioned patch. They probably have smart things to say :)
(+Wanpeng Li, +Bandan Das)

Paolo




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux