On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote: > On 24.06.14 20:53, Scott Wood wrote: > > On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote: > >> Howdy, > >> > >> Ben reminded me a while back that we have a nasty race in our KVM PV code. > >> > >> We replace a few instructions with longer streams of instructions to check > >> whether it's necessary to trap out from it (like mtmsr, no need to trap if > >> we only disable interrupts). During those replacement chunks we must not get > >> any interrupts, because they might overwrite scratch space that we already > >> used to save otherwise clobbered register state into. > >> > >> So we have a thing called "critical sections" which allows us to atomically > >> get in and out of "interrupt disabled" modes without touching MSR. When we > >> are supposed to deliver an interrupt into the guest while we are in a critical > >> section, we just don't inject the interrupt yet, but leave it be until the > >> next trap. > >> > >> However, we never really know when the next trap would be. For all we know it > >> could be never. At this point we created a race that is a potential source > >> for interrupt loss or at least deferral. > >> > >> This patch set aims at solving the race. Instead of merely deferring an > >> interrupt when we see such a situation, we go into a special instruction > >> interpretation mode. In this mode, we interpret all PPC assembler instructions > >> that happen until we are out of the critical section again, at which point > >> we can now inject the interrupt. > >> > >> This bug only affects KVM implementations that make use of the magic page, so > >> e500v2, book3s_32 and book3s_64 PR KVM. > > Would it be possible to single step through the critical section > > instead? Or set a high res timer to expire very quickly? > > There are a few other alternatives to this implementation: > > 1) Unmap the magic page, emulate all memory access to it while in > critical and irq pending > 2) Trigger a timer that sends a request to the vcpu to wake it from > potential sleep and inject the irq > 3) Single step until we're beyond the critical section > 4) Probably more that I can't think of right now :) > > Each has their good and bad sides. Unmapping the magic page adds > complexity to the MMU mapping code, since we need to make sure we don't > map it back in on demand and treat faults to it specially. > > The timer interrupt works, but I'm not fully convinced that it's a good > idea for things like MC events which we also block during critical > sections on e500v2. Are you concerned about the guest seeing machine checks that are (more) asynchronous with the error condition? e500v2 machine checks are always asynchronous. From the core manual: Machine check interrupts are typically caused by a hardware or memory subsystem failure or by an attempt to access an invalid address. They may be caused indirectly by execution of an instruction, but may not be recognized or reported until long after the processor has executed past the instruction that caused the machine check. As such, machine check interrupts are not thought of as synchronous or asynchronous nor as precise or imprecise. I don't think the lag would be a problem, and certainly it's better than the current situation. > Single stepping is hard enough to get right on interaction between QEMU, > KVM and the guest. I didn't really want to make that stuff any more > complicated. I'm not sure that it would add much complexity. We'd just need to check whether any source other than the magic page turned wants DCBR0_IC on, to determine whether to exit to userspace or not. > This approach is really just one out of many - and it's one that's > nicely self-contained and shouldn't have any impact at all on > implementations that don't care about it ;). "Nicely self-contained" is not a phrase I'd associate with 33 patches, including a bunch of new emulation that probably isn't getting great test coverage. -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html