Re: [PATCH 0/4] KVM: nVMX: Fix pending event injection on nested scenarios

Liran Alon <LIRAN.ALON@xxxxxxxxxx> · Wed, 08 Nov 2017 02:40:16 +0200

On 07/11/17 16:25, Paolo Bonzini wrote:
On 05/11/2017 15:07, Liran Alon wrote:
Hi,

This series of patches aim to fix multiple related issues with how
pending event injection works on nVMX.

The first patch fixes a simple error in the return-value of
vmx_check_nested_events(). Next patch relies on it to correctly
determine when an immediate-exit is required from L2 to L1.

The second patch fixes a critical bug which caused L1 to miss an IPI
in case it was sent when destination CPU exited from L2 to L0 due to
event-delivery.

The third patch removes a now reduntant signaling of KVM_REQ_EVENT.
This actually masked the issue fixed in the previous patch.

The fourth patch fixes an issue of not always syncing PIR to L1
Virtual-APIC-Page when KVM_REQ_EVENT is signaled. This patch relies
on vmx_check_nested_events() always being called when KVM_REQ_EVENT is
set which is true since the second patch.

With all the discussions on the other series, I didn't reply here.  I
haven't commented yet because I want to see first of all whether we have
coverage in kvm-unit-tests of the issue that the first two patches fix
(i.e., does something break in kvm-unit-tests if I only apply patch 3).
I don't think there is a kvm-unit-test which should cover this as just 
by looking at x86/unittests.cfg at all tests which have a "smp" 
parameter, it seems none of them relates to nVMX. Because the bug nature 
requires at least 2 L1 CPUs, I think this is not covered.
I actually have a couple of additional patches that relates to 
race-conditions in nVMX. I will send them soon.

I thought about writing a unit-test for this issue but it is a complex 
race-condition and every test I thought about would be probabilistic. As 
you will need a L1 IPI to be sent to dest CPU that is exactly at the 
moment of exiting from L2 to L0 due to event-delivery but not because of 
L1 (you want to avoid VMExit to be forwarded from L0 to L1 but rather 
resume again to L2. For example, making sure that nEPT will have L2 
IDT-descriptor unmapped but is mapped correctly in L1 EPT.)

I would be happy for ideas on how to make such a unit-test not 
probabilistic (Besides repeating it large amount of iterations in hope 
that chance not to reproduce is very small).

The unit-test I thought about goes something like this:
1. CPU A sends X L1 IPIs to CPU B. The IPI handler in CPU B will 
increment a var which documents amount of times the IPI handler actually 
got to dest CPU. Before each time CPU A send an IPI, it will busy-loop 
to see var indeed incremented (L1 IPI Ack).
2. CPU B will run a loop of Y iterations of the following:
a) Setup L1 EPT that maps memory 1-to-1. Including memory that will be 
used for L2 IDT.
b) Before entering L2, execute INVEPT on L2 IDT memory such that L0 nEPT 
will remove it's mapping for it. (As done for Shadow MMU when running 
"invlpg" instruction)
c) Enter L2 and perform an "int $x" instruction. This should exit from 
L2 to L0 due to event-delivery. After L2's interrupt-handler returns, 
exit to L1 (by using vmcall) and repeat from step a.

Also, I had some incomplete work that eliminates
vmx_inject_page_fault_nested (which IMO is only papering over bugs,
because anything it does should be covered in theory by
nested_vmx_inject_exception_vmexit).  I'm curious if patches 1-2 help
there too.
I'm probably missing something but I fail to see how patches 1-2 could 
help there. Can you explain?

However, all this is going to be work for 4.16.
I am sorry for the long respond but I think a bit of context here will 
help discussion a lot.

Oracle Ravello (which I work at) is a cloud provider running on top of 
cloud providers. When we run on top of Oracle OCI cloud, we run customer 
VMs using KVM nested virtualization.
The issue at hand here was discovered from *real production* use-cases 
and was rare until we managed to find a simple single use-case which 
reproduced ~100% (I described the setup which reproduce on commit 
message with hope that others can reproduce it as-well).
After issue was reproduced, we were able to pin-point the root-cause and 
verify the fix works (as it reproduced ~100%). Also it was clear from 
KVM trace-pipe that this is indeed the issue.

In addition, in inject_pending_event() one can see a comment referring 
to: https://lkml.org/lkml/2014/7/2/60
Following the thread, one can see that an attempt was made to only set 
KVM_REQ_EVENT when a L1 event was actually blocked instead of when 
nested_run_pending=true. However, Bandan Das then notes this change 
caused L1 to be stuck on smp_call_function_many(). In addition, looking 
at the bug discussion the thread claims to fix 
(https://bugzilla.kernel.org/show_bug.cgi?id=72381), one can see at the 
attached L1 serial the exact stack-traces that we saw in the issue I 
fixed: All L1 CPUs attempting to grab mmu_lock while one CPU is holding 
the mmu_lock and is currently stuck waiting for ACK on some IPI.
At the end of patch-discussion, it was decided to fix the issue 
(correctly I must say) by adding an extra call to check_nested_events() 
in inject_pending_event(). However, the original setting of 
KVM_REQ_EVENT on nested_run_pending=true remained as no-one in thread 
was able to pin-point root-cause of that bug.

I believe that my commit solves exactly the bug mentioned there by 
Bandan and therefore the third patch in series now removes the setting 
of KVM_REQ_EVENT on nested_run_pending=true from code.

Adding to CC the guys who worked on the above mentioned patch. They 
probably have smart things to say :)
(+Wanpeng Li, +Bandan Das)

Paolo