> > > > > > > But what about NMIs, afaict this is all horribly broken for NMIs. > > > > > > > So the whole VMX thing latches the NMI (which stops NMI > > > > > > > recursion), right? > > > > > > > > > > > > > > But then you drop out of noinstr code, which means any > > > > > > > random exception can happen (kprobes #BP, hw_breakpoint #DB, > > > > > > > or even #PF due to random nonsense like *SAN). This > > > > > > > exception will do IRET and clear the NMI latch, all before > > > > > > > you get to run any of the NMI code. > > > > > > > > > > > > What you said here implies that we have this problem in the existing > code. > > > > > > Because a fake iret stack is created to call the NMI handler > > > > > > in the IDT NMI descriptor, which lastly executes the IRET instruction. > > > > > > > > > > I can't follow; of course the IDT handler terminates with IRET, it has to > no? > > > > > > > > > > And yes, the current code appears to suffer the same defect. > > That defect isn't going to be fixed simply by changing how KVM forwards NMIs > though. IIUC, _everything_ between VM-Exit and the invocation of the NMI > handler needs to be noinstr. On VM-Exit due to NMI, NMIs are blocked. If a > #BP/#DB/#PF occurs before KVM gets to kvm_x86_handle_exit_irqoff(), the > subsequent IRET will unblock NMIs before the original NMI is serviced, i.e. a > second NMI could come in at anytime regardless of how KVM forwards the NMI > to the kernel. > > Is there any way to solve this without tagging everything noinstr? There is a > metric shit ton of code between VM-Exit and the handling of NMIs, and much > of that code is common helpers. It might be possible to hoist NMI handler > much earlier, though we'd need to do a super thorough audit to ensure all > necessary host state is restored. As NMI is the only vector with this potential issue, it sounds a good idea to only promote its handling. > > > > With FRED, ERETS/ERETU replace IRET, and use bit 28 of the popped > > > > CS field to control whether to unblock NMI. If bit 28 of the field > > > > (above the selector) is 1, ERETS/ERETU unblocks NMIs. > > Side topic, there's a bug in the ISE docs. Section "9.4.2 NMI Blocking" states > that bit 16 holds the "unblock NMI" magic, which I'm guessing is a holdover > from an earlier revision of FRED. Good catch, the latest 3.0 draft spec changed it to bit 28, but section 9.4.2 didn't get a proper update. > > As specified in Section 6.1.3 and Section 6.2.3, ERETS and ERETU each > unblocks NMIs > if bit 16 of the popped CS field is 1. The following items detail how this > behavior may be > changed in VMX non-root operation, depending on the settings of certain VM- > execution > controls: > > > > Yes, I know that. It is one of the many improvements FRED brings. > > > Ideally the IBT WAIT-FOR-ENDBR state also gets squirreled away in > > > the hardware exception frame, but that's still up in the air I > > > believe :/ > > > > > > Anyway.. given there is interrupt priority and NMI is pretty much on > > > top of everything else the reinject crap *should* run NMI first. > > > That way NMI runs with the latch disabled and whatever other pending > interrupts will run later. > > > > > > But that all is still broken because afaict the current code also > > > leaves noinstr -- and once you leave noinstr (or use a static_key, > > > static_call or anything else that > > > *can* change at runtime) you can't guarantee nothing. > > > > For NMI, HPA asked me to use "int $2", as it switches to the NMI IST > > stack to execute the NMI handler, essentially like how HW deals with a > > NMI in host. and I tested it with NMI watchdog, it looks working fine. > > Heh, well yeah, because that's how KVM used to handle NMIs back before I > reworked NMI handling to use the direct call method. Ironically, that original > change was done in part to try and make it _easier_ to deal with FRED (back > before FRED was publicly disclosed). > > If KVM reverts to INTn, the fix to route KVM=>NMI through the non-IST entry > can be reverted too. > > a217a6593cec ("KVM/VMX: Invoke NMI non-IST entry instead of IST entry") > 1a5488ef0dcf ("KVM: VMX: Invoke NMI handler via indirect call instead of > INTn") Sure, I'm just thinking to put asm("int $2") in a new function exc_raise_nmi() defined in arch/x86/kernel/traps.c. Thus we will need to change it only whenever we have any better facility, and no KVM VMX change required.