On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > Andy Lutomirski <luto@xxxxxxxxxx> writes: > >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > >> Aside of that the actual code is a convoluted one fits it all swiss army > >> knife. It is invoked from different places with different RCU constraints: > >> > >> 1) Host side: > >> > >> vcpu_enter_guest() > >> kvm_x86_ops->handle_exit() > >> kvm_handle_page_fault() > >> kvm_async_pf_task_wait() > >> > >> The invocation happens from fully preemptible context. > > > > I’m a bit baffled as to why the host uses this code at all instead of > > just sleeping the old fashioned way when the guest takes a (host) page > > fault. Oh well. > > If I can trust the crystal ball which I used to decode this maze then it > actually makes sense. > > Aysnc faults are faults which cannot be handled by the guest, i.e. the > host either pulled a page away under the guest or did not populate it in > the first place. > > So the reasoning is that if this happens the guest might be in a > situation where it can schedule other tasks instead of being stopped > completely by the host until the page arrives. > > Now you could argue that this mostly makes sense for CPL 0 faults, but > there is definitely code in the kernel where it makes sense as well, > e.g. exec. But of course as this is designed without a proper handshake > there is no way for the hypervisor to figure out whether it makes sense > or not. > > If the async fault cannot be delivered to the guest (async PF disabled, > async PF only enabled for CPL 0, IF == 0) then the host utilizes the > same data structure and wait mechanism. That really makes sense. > > The part which does not make sense in the current implementation is the > kvm_async_pf_task_wait() trainwreck. A clear upfront separation of > schedulable and non schedulable wait mechanisms would have avoided all > the RCU duct tape nonsense and also spared me the brain damage caused by > reverse engineering this completely undocumented mess. > > >> +static void kvm_async_pf_task_wait_halt(u32 token) > >> +{ > >> + struct kvm_task_sleep_node n; > >> + > >> + if (!kvm_async_pf_queue_task(token, true, &n)) > >> + return; > >> > >> for (;;) { > >> - if (!n.halted) > >> - prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE); > >> if (hlist_unhashed(&n.link)) > >> break; > >> + /* > >> + * No point in doing anything about RCU here. Any RCU read > >> + * side critical section or RCU watching section can be > >> + * interrupted by VMEXITs and the host is free to keep the > >> + * vCPU scheduled out as long as it sees fit. This is not > >> + * any different just because of the halt induced voluntary > >> + * VMEXIT. > >> + * > >> + * Also the async page fault could have interrupted any RCU > >> + * watching context, so invoking rcu_irq_exit()/enter() > >> + * around this is not gaining anything. > >> + */ > >> + native_safe_halt(); > >> + local_irq_disable(); > > > > What’s the local_irq_disable() here for? I would believe a > > lockdep_assert_irqs_disabled() somewhere in here would make sense. > > (Yes, I see you copied this from the old code. It’s still nonsense.) > > native_safe_halt() does: > > STI > HLT > > So the irq disable is required as the loop should exit with interrupts > disabled. Oops, should have looked at what native_safe_halt() does. > > > I also find it truly bizarre that hlt actually works in this context. > > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a > > pending async pf is satisfied? This would make sense if the wake > > event were an interrupt, but it’s not according to Paolo. > > See above. safe halt enables interrupts, which means IF == 1 and the > host sanity check for IF == 1 is satisfied. > > In fact, if e.g. some regular interrupt arrives before the page becomes > available and the guest entered the halt loop because the fault happened > inside a RCU read side critical section with preemption enabled, then > the interrupt might wake up another task, set need resched and this > other task can run. Now I'm confused again. Your patch is very careful not to schedule if we're in an RCU read-side critical section, but the regular preemption code (preempt_schedule_irq, etc) seems to be willing to schedule inside an RCU read-side critical section. Why is the latter okay but not the async pf case? Ignoring that, this still seems racy: STI nested #PF telling us to wake up #PF returns HLT doesn't this result in putting the CPU asleep for no good reason until the next interrupt hits? > > All this being said, the only remotely sane case is when regs->flags > > has IF==1. Perhaps this code should actually do: > > > > WARN_ON(!(regs->flags & X86_EFLAGS_IF)); > > Yes, that want's to be somewhere early and also cover the async wake > case. Neither wake nor wait can be injected when IF == 0. Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.