Andy Lutomirski <luto@xxxxxxxxxx> writes: > On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> > What’s the local_irq_disable() here for? I would believe a >> > lockdep_assert_irqs_disabled() somewhere in here would make sense. >> > (Yes, I see you copied this from the old code. It’s still nonsense.) >> >> native_safe_halt() does: >> >> STI >> HLT >> >> So the irq disable is required as the loop should exit with interrupts >> disabled. > > Oops, should have looked at what native_safe_halt() does. > >> >> > I also find it truly bizarre that hlt actually works in this context. >> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a >> > pending async pf is satisfied? This would make sense if the wake >> > event were an interrupt, but it’s not according to Paolo. >> >> See above. safe halt enables interrupts, which means IF == 1 and the >> host sanity check for IF == 1 is satisfied. >> >> In fact, if e.g. some regular interrupt arrives before the page becomes >> available and the guest entered the halt loop because the fault happened >> inside a RCU read side critical section with preemption enabled, then >> the interrupt might wake up another task, set need resched and this >> other task can run. > > Now I'm confused again. Your patch is very careful not to schedule if > we're in an RCU read-side critical section, but the regular preemption > code (preempt_schedule_irq, etc) seems to be willing to schedule > inside an RCU read-side critical section. Why is the latter okay but > not the async pf case? Preemption is fine, but voluntary schedule not. voluntary schedule might end up in idle if this is the last runnable task. > Ignoring that, this still seems racy: > > STI > nested #PF telling us to wake up > #PF returns > HLT You will say Ooops, should have looked .... when I tell you that the above cannot happen. From the SDM: If IF = 0, maskable hardware interrupts remain inhibited on the instruction boundary following an execution of STI. Otherwise safe_halt would not work at all :) > doesn't this result in putting the CPU asleep for no good reason until > the next interrupt hits? No :) > >> > All this being said, the only remotely sane case is when regs->flags >> > has IF==1. Perhaps this code should actually do: >> > >> > WARN_ON(!(regs->flags & X86_EFLAGS_IF)); >> >> Yes, that want's to be somewhere early and also cover the async wake >> case. Neither wake nor wait can be injected when IF == 0. > > Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0. WHAT? That's fundamentally broken. Can you point me to the code in question? Thanks, tglx