On 07/11/17 16:25, Paolo Bonzini wrote:
On 05/11/2017 15:07, Liran Alon wrote:
Hi,
This series of patches aim to fix multiple related issues with how
pending event injection works on nVMX.
The first patch fixes a simple error in the return-value of
vmx_check_nested_events(). Next patch relies on it to correctly
determine when an immediate-exit is required from L2 to L1.
The second patch fixes a critical bug which caused L1 to miss an IPI
in case it was sent when destination CPU exited from L2 to L0 due to
event-delivery.
The third patch removes a now reduntant signaling of KVM_REQ_EVENT.
This actually masked the issue fixed in the previous patch.
The fourth patch fixes an issue of not always syncing PIR to L1
Virtual-APIC-Page when KVM_REQ_EVENT is signaled. This patch relies
on vmx_check_nested_events() always being called when KVM_REQ_EVENT is
set which is true since the second patch.
With all the discussions on the other series, I didn't reply here. I
haven't commented yet because I want to see first of all whether we have
coverage in kvm-unit-tests of the issue that the first two patches fix
(i.e., does something break in kvm-unit-tests if I only apply patch 3).
I don't think there is a kvm-unit-test which should cover this as just
by looking at x86/unittests.cfg at all tests which have a "smp"
parameter, it seems none of them relates to nVMX. Because the bug nature
requires at least 2 L1 CPUs, I think this is not covered.
I actually have a couple of additional patches that relates to
race-conditions in nVMX. I will send them soon.
I thought about writing a unit-test for this issue but it is a complex
race-condition and every test I thought about would be probabilistic. As
you will need a L1 IPI to be sent to dest CPU that is exactly at the
moment of exiting from L2 to L0 due to event-delivery but not because of
L1 (you want to avoid VMExit to be forwarded from L0 to L1 but rather
resume again to L2. For example, making sure that nEPT will have L2
IDT-descriptor unmapped but is mapped correctly in L1 EPT.)
I would be happy for ideas on how to make such a unit-test not
probabilistic (Besides repeating it large amount of iterations in hope
that chance not to reproduce is very small).
The unit-test I thought about goes something like this:
1. CPU A sends X L1 IPIs to CPU B. The IPI handler in CPU B will
increment a var which documents amount of times the IPI handler actually
got to dest CPU. Before each time CPU A send an IPI, it will busy-loop
to see var indeed incremented (L1 IPI Ack).
2. CPU B will run a loop of Y iterations of the following:
a) Setup L1 EPT that maps memory 1-to-1. Including memory that will be
used for L2 IDT.
b) Before entering L2, execute INVEPT on L2 IDT memory such that L0 nEPT
will remove it's mapping for it. (As done for Shadow MMU when running
"invlpg" instruction)
c) Enter L2 and perform an "int $x" instruction. This should exit from
L2 to L0 due to event-delivery. After L2's interrupt-handler returns,
exit to L1 (by using vmcall) and repeat from step a.
Also, I had some incomplete work that eliminates
vmx_inject_page_fault_nested (which IMO is only papering over bugs,
because anything it does should be covered in theory by
nested_vmx_inject_exception_vmexit). I'm curious if patches 1-2 help
there too.
I'm probably missing something but I fail to see how patches 1-2 could
help there. Can you explain?
However, all this is going to be work for 4.16.
I am sorry for the long respond but I think a bit of context here will
help discussion a lot.
Oracle Ravello (which I work at) is a cloud provider running on top of
cloud providers. When we run on top of Oracle OCI cloud, we run customer
VMs using KVM nested virtualization.
The issue at hand here was discovered from *real production* use-cases
and was rare until we managed to find a simple single use-case which
reproduced ~100% (I described the setup which reproduce on commit
message with hope that others can reproduce it as-well).
After issue was reproduced, we were able to pin-point the root-cause and
verify the fix works (as it reproduced ~100%). Also it was clear from
KVM trace-pipe that this is indeed the issue.
In addition, in inject_pending_event() one can see a comment referring
to: https://lkml.org/lkml/2014/7/2/60
Following the thread, one can see that an attempt was made to only set
KVM_REQ_EVENT when a L1 event was actually blocked instead of when
nested_run_pending=true. However, Bandan Das then notes this change
caused L1 to be stuck on smp_call_function_many(). In addition, looking
at the bug discussion the thread claims to fix
(https://bugzilla.kernel.org/show_bug.cgi?id=72381), one can see at the
attached L1 serial the exact stack-traces that we saw in the issue I
fixed: All L1 CPUs attempting to grab mmu_lock while one CPU is holding
the mmu_lock and is currently stuck waiting for ACK on some IPI.
At the end of patch-discussion, it was decided to fix the issue
(correctly I must say) by adding an extra call to check_nested_events()
in inject_pending_event(). However, the original setting of
KVM_REQ_EVENT on nested_run_pending=true remained as no-one in thread
was able to pin-point root-cause of that bug.
I believe that my commit solves exactly the bug mentioned there by
Bandan and therefore the third patch in series now removes the setting
of KVM_REQ_EVENT on nested_run_pending=true from code.
Adding to CC the guys who worked on the above mentioned patch. They
probably have smart things to say :)
(+Wanpeng Li, +Bandan Das)
Paolo