On 15/05/20 22:43, Sean Christopherson wrote: > On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote: >> On 15/05/20 20:46, Sean Christopherson wrote: >>> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case? VMX >>> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we >>> can (ab)use the error code to indicate an async #PF by setting it to an >>> impossible value, e.g. 0xaaaa (a is for async!). That partciular error code >>> is even enforced by the SDM, which states: >> >> Possibly, but it's water under the bridge now. > > Why is that? I thought we were redoing the entire thing because the current > ABI is unfixably broken? In other words, since the guest needs to change, > why are we keeping any of the current async #PF pieces? E.g. why keep using > #PF instead of usurping something like #NP? Because that would be 3 ABIs to support instead of 2. The #PF solution is only broken as long as you allow async PF from ring 0 (which wasn't even true except for preemptable kernels) _and_ have NMIs that can generate page faults. We also have the #PF vmexit part for nested virtualization. This adds up and makes a quick fix for 'page not ready' notifications not that quick. However, interrupts for 'page ready' do have a bunch of advantages (more control on what can be preempted by the notification, a saner check for new page faults which is effectively a bug fix) so it makes sense to get them in more quickly (probably 5.9 at this point due to the massive cleanups that are being done around interrupt vectors). >> And the #PF mechanism also has the problem with NMIs that happen before the >> error code is read and page faults happening in the handler. > > Hrm, I think there's no unfixable problem except for a pathological > #PF->NMI->#DB->#PF scenario. But it is a problem :-( Yeah, that made no sense. But still I'm not sure the x86 maintainers would like it. The only possible isue with #VE is the re-entrancy at the end. Andy proposed re-enabling it from an interrupt, but here is one solution that can be done almost entirely from C. The idea is to split the IST in two halves, and flip between them in the TSS with an XOR operation on entry to the interrupt handler. This is possible because there won't ever be more than two handlers active at the same time. Unlike if you used SUB/ADD, with XOR you don't have to restore the IST on exit: the two halves will take turns as the current IST and there's no problematic window between the ADD and the IRET. The pseudocode would be: - on #VE entry xor 512 with the IST address in the TSS check if saved RSP comes from the IST if so: overwrite the saved flags/CS/SS in the "other" IST half with the current value of the registers overwrite the saved RSP in the "other" IST half with the address of the top of the IST itself overwrite the saved RIP in the "other" IST half with the address of a trampoline (see below) else: save the top 5 words of the IST somewhere do normal stuff - the trampoline restores the 5 words at the top of the IST with five push instructions, and jumps back to the first instruction of the handler Everything except the first step can even be done in C. Here is an example. Assuming that on entry to the outer #VE the IST is the "even" half, the outer #VE moves the IST to the "odd" half and the return flags/CS/SS/RSP/RIP are saved. After the reentrancy flag has been cleared, a nested #VE arrives and runs within the "odd" half of the IST. The IST is moved back to the "even" half and the return flags/CS/SS/RSP/RIP in the "even" half are patched to point to the trampoline. When we get back to the outer handler the reentrancy flag not zero, so even though the IST points to the current stack, reentrancy is impossible and we go just fine through the few final instructions of the handler. On outer #VE exit, the IRET instruction jumps to the trampoline, with RSP pointing at the top of the "even" half. The return flags/CS/SS/RSP/RIP are restored, and everything restarts from the beginning: the outer #VE moves the IST to the "odd" half, the return flags/CS/SS/RSP/RIP are saved, the data for the nested #VE is fished out of the virtualization exception area and so on. Paolo