On Fri, 2021-12-17 at 13:21 +0000, Mark Rutland wrote: > On Fri, Dec 17, 2021 at 12:51:57PM +0100, Nicolas Saenz Julienne wrote: > > Hi All, > > Hi, > > > arm64's guest entry code does the following: > > > > int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) > > { > > [...] > > > > guest_enter_irqoff(); > > > > ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu); > > > > [...] > > > > local_irq_enable(); > > > > /* > > * We do local_irq_enable() before calling guest_exit() so > > * that if a timer interrupt hits while running the guest we > > * account that tick as being spent in the guest. We enable > > * preemption after calling guest_exit() so that if we get > > * preempted we make sure ticks after that is not counted as > > * guest time. > > */ > > guest_exit(); > > [...] > > } > > > > > > On a nohz-full CPU, guest_{enter,exit}() delimit an RCU extended quiescent > > state (EQS). Any interrupt happening between local_irq_enable() and > > guest_exit() should disable that EQS. Now, AFAICT all el0 interrupt handlers > > do the right thing if trggered in this context, but el1's won't. Is it > > possible to hit an el1 handler (for example __el1_irq()) there? > > I think you're right that the EL1 handlers can trigger here and won't exit the > EQS. > > I'm not immediately sure what we *should* do here. What does x86 do for an IRQ > taken from a guest mode? I couldn't spot any handling of that case, but I'm not > familiar enough with the x86 exception model to know if I'm looking in the > right place. Well x86 has its own private KVM guest context exit function 'kvm_guest_exit_irqoff()', which allows it to do the right thing (simplifying things): local_irq_disable(); kvm_guest_enter_irqoff() // Inform CT, enter EQS __vmx_kvm_run() kvm_guest_exit_irqoff() // Inform CT, exit EQS, task still marked with PF_VCPU /* * Consume any pending interrupts, including the possible source of * VM-Exit on SVM and any ticks that occur between VM-Exit and now. * An instruction is required after local_irq_enable() to fully unblock * interrupts on processors that implement an interrupt shadow, the * stat.exits increment will do nicely. */ local_irq_enable(); ++vcpu->stat.exits; local_irq_disable(); /* * Wait until after servicing IRQs to account guest time so that any * ticks that occurred while running the guest are properly accounted * to the guest. Waiting until IRQs are enabled degrades the accuracy * of accounting via context tracking, but the loss of accuracy is * acceptable for all known use cases. */ vtime_account_guest_exit(); // current->flags &= ~PF_VCPU So I guess we should convert to x86's scheme, and maybe create another generic guest_{enter,exit}() flavor for virtualization schemes that run with interrupts disabled. > Note that the EL0 handlers *cannot* trigger for an exception taken from a > guest. We use separate vectors while running a guest (for both VHE and nVHE > modes), and from the main kernel's PoV we return from kvm_call_hyp_ret(). We > can ony take IRQ from EL1 *after* that returns. > > We *might* need to audit the KVM vector handlers to make sure they're not > dependent on RCU protection (I assume they're not, but it's possible something > has leaked into the VHE code). IIUC in the window between local_irq_enable() and guest_exit() any driver interrupt might trigger, isn't it? Regards, -- Nicolás Sáenz