Andy Lutomirski <luto@xxxxxxxxxx> writes: >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> Aside of that the actual code is a convoluted one fits it all swiss army >> knife. It is invoked from different places with different RCU constraints: >> >> 1) Host side: >> >> vcpu_enter_guest() >> kvm_x86_ops->handle_exit() >> kvm_handle_page_fault() >> kvm_async_pf_task_wait() >> >> The invocation happens from fully preemptible context. > > I’m a bit baffled as to why the host uses this code at all instead of > just sleeping the old fashioned way when the guest takes a (host) page > fault. Oh well. If I can trust the crystal ball which I used to decode this maze then it actually makes sense. Aysnc faults are faults which cannot be handled by the guest, i.e. the host either pulled a page away under the guest or did not populate it in the first place. So the reasoning is that if this happens the guest might be in a situation where it can schedule other tasks instead of being stopped completely by the host until the page arrives. Now you could argue that this mostly makes sense for CPL 0 faults, but there is definitely code in the kernel where it makes sense as well, e.g. exec. But of course as this is designed without a proper handshake there is no way for the hypervisor to figure out whether it makes sense or not. If the async fault cannot be delivered to the guest (async PF disabled, async PF only enabled for CPL 0, IF == 0) then the host utilizes the same data structure and wait mechanism. That really makes sense. The part which does not make sense in the current implementation is the kvm_async_pf_task_wait() trainwreck. A clear upfront separation of schedulable and non schedulable wait mechanisms would have avoided all the RCU duct tape nonsense and also spared me the brain damage caused by reverse engineering this completely undocumented mess. >> +static void kvm_async_pf_task_wait_halt(u32 token) >> +{ >> + struct kvm_task_sleep_node n; >> + >> + if (!kvm_async_pf_queue_task(token, true, &n)) >> + return; >> >> for (;;) { >> - if (!n.halted) >> - prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE); >> if (hlist_unhashed(&n.link)) >> break; >> + /* >> + * No point in doing anything about RCU here. Any RCU read >> + * side critical section or RCU watching section can be >> + * interrupted by VMEXITs and the host is free to keep the >> + * vCPU scheduled out as long as it sees fit. This is not >> + * any different just because of the halt induced voluntary >> + * VMEXIT. >> + * >> + * Also the async page fault could have interrupted any RCU >> + * watching context, so invoking rcu_irq_exit()/enter() >> + * around this is not gaining anything. >> + */ >> + native_safe_halt(); >> + local_irq_disable(); > > What’s the local_irq_disable() here for? I would believe a > lockdep_assert_irqs_disabled() somewhere in here would make sense. > (Yes, I see you copied this from the old code. It’s still nonsense.) native_safe_halt() does: STI HLT So the irq disable is required as the loop should exit with interrupts disabled. > I also find it truly bizarre that hlt actually works in this context. > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a > pending async pf is satisfied? This would make sense if the wake > event were an interrupt, but it’s not according to Paolo. See above. safe halt enables interrupts, which means IF == 1 and the host sanity check for IF == 1 is satisfied. In fact, if e.g. some regular interrupt arrives before the page becomes available and the guest entered the halt loop because the fault happened inside a RCU read side critical section with preemption enabled, then the interrupt might wake up another task, set need resched and this other task can run. At some point the halt waiting task gets back into the loop and either finds the page ready or goes back into halt. If the fault hit a preempt disabled region then still the interrupt and eventual resulting soft interrupts can be handled. Both scenarios are correct when you actually manage to mentaly disconnect regular #PF and async #PF completely. Of course the overloading of regular #PF, the utter lack of documentation and the crappy and duct taped implementation makes this really a mind boggling exercise. > All this being said, the only remotely sane case is when regs->flags > has IF==1. Perhaps this code should actually do: > > WARN_ON(!(regs->flags & X86_EFLAGS_IF)); Yes, that want's to be somewhere early and also cover the async wake case. Neither wake nor wait can be injected when IF == 0. > while (the page isn’t ready) { > local_irq_enable(); > native_safe_halt(); > local_irq_disable(); > } > > with some provision to survive the case where the warning fires so we > at least get logs. I don't think that any attempt to survive a async #PF injection into a interrupt disabled region makes sense aside of looking smart and being uncomprehensible voodoo. If this ever happens then the host side is completely buggered and all we can do is warn and pray or warn and die hard. My personal preference is to warn and die hard. > In any event, I just sent a patch to disable async page faults that > happen in kernel mode. I don't think it's justified. The host side really makes sure that the guest does have IF == 1 before injecting anything which is not a NMI. If the guest enables interrupts at the wrong place then this is really not the hosts problem. Having a warning in that async pf entry for the IF == 0 injection case is good enough IMO. Thanks, tglx