On Fri, Jan 11, 2013 at 05:45:28PM +0200, Gleb Natapov wrote: > On Thu, Jan 10, 2013 at 11:40:07PM -0700, Matthew Ogilvie wrote: > > On Tue, Jan 08, 2013 at 09:31:19AM +0200, Gleb Natapov wrote: > > > On Mon, Jan 07, 2013 at 06:17:22PM -0600, mmogilvi@xxxxxxxxxxxx wrote: > > > > On Mon, 7 Jan 2013 11:39:18 +0200, Gleb Natapov <gleb@xxxxxxxxxx> wrote: > > > > > On Wed, Dec 26, 2012 at 10:39:53PM -0700, Matthew Ogilvie wrote: > > > > >> Reading the spec, it is clear that most modes normally leave the IRQ > > > > >> output line high, and only pulse it low to generate a leading edge. > > > > >> Especially the most commonly used mode 2. > > > > >> > > > > >> The KVM i8254 model does not try to emulate the duration of the pulse at > > > > >> all, so just swap the high/low settings it to leave it high most of > > > > >> the time. > > > > >> > > > > >> This fix is a prerequisite to improving the i8259 model to handle > > > > >> the trailing edge of an interupt request as indicated in its spec: > > > > >> If it gets a trailing edge of an IRQ line before it starts to service > > > > >> the interrupt, the request should be canceled. > > > > >> > > > > >> See http://bochs.sourceforge.net/techspec/intel-82c54-timer.pdf.gz > > > > >> or search the net for 23124406.pdf. > > > > >> > > > > >> Risks: > > > > >> > > > > >> There is a risk that migrating a running guest between versions > > > > >> with and without this patch will lose or gain a single timer > > > > >> interrupt during the migration process. The only case where > > > > > Can you elaborate on how exactly this can happen? Do not see it. > > > > > > > > > > > > > KVM 8254: In the corrected model, when the count expires, the model > > > > briefly pulses output low and then high again, with the low to high > > > > transition being what triggers the interrupt. In the old model, > > > > when the count expires, the model expects the output line > > > > to already be low, and briefly pulses it high (triggering the > > > > interrupt) and then low again. But if the line was already > > > > high (because it migrated from the corrected model), > > > > this won't generate a new leading edge (low to high) and won't > > > > trigger a new interrupt (the first post-back-migration pulse turns > > > > into a simple trailing edge instead of a pulse). > > > > > > > > Unless there is something I'm missing? > > > > > > > No, I missed that pic->last_irr/ioapic->irr will be migrated as 1. But > > > this means that next interrupt after migration from new to old will > > > always be lost. What about clearing pit bit from last_irr/irr before > > > migration? Should not affect new->new migration and should fix new->old > > > one. The only problem is that we may need to consult irq routing table > > > to know how pit is connected to ioapic. > > > > We should not clear both last_irr and irr. That > > cancels the interrupt early if the CPU hasn't started servicing > > it yet: If the guest CPU has interrupts disabled > > and hasn't begun to service the interrupt, a new->new migration could > > lose the interrupt much earlier in the countdown cycle than it otherwise > > might be lost. > > > I am talking about last_irr in pic and irr in ioapic. Of course we > shouldn't clear pic->irr on migration. ioapic uses irr to detect edge. I probably need to look into the details of how the ioapic is supposed to work. Sigh. > > > Potentially we could clear the last_irr bit only. This effectively > > makes migration behave like the old unfixed code. But if this is > > considered acceptable, I would be more inclined to not fix IRQ0 at all, > If we do not fix IRQ0 next timer interrupt after migration will always be > lost. Obviously true if the trailing edge consumption logic for the IRQ0 line is corrected in the 8259. But if it isn't: I probably could have been clearer that we could leave the 8254 unchanged, and adjust the 8259 fix to leave IRQ0 as-is (with the incorrect handling of a trailing edge - only other IRQ lines would be fixed), and then timer interrupts would just work exactly as they do now. This minimal fix would would probably be the lowest risk of breaking something that currently works, but I don't know if people would go for a patch that intentionally leaves in known breakage in IRQ0... This is option 1 below. ------ If we cleared last_irr in the qemu model during migration, we risk getting an EXTRA interrupt when migrating mode 4 (misremembered as mode 2 in an earlier email below) from old to new models. (If the sequence goes: line is asserted, guest migrates, interrupt is handled [all in less than a 8254 clock tick (approx 1 MHz)], then the off-by-one edge in the new model triggers another leading edge...) In contrast, this might not be a risk in the KVM model as currently implemented. Maybe this objection is not important, and you like option 4 listed below. ------- So I'm still not sure what overall fix strategy the main qemu and kvm maintainers would like the best. There are several possibilities, but they all seem to have notable downsides. Some possible fix strategies: 1. Only fix the IRQ2 (cascade) line in 8259. Leave trailing edge logic for other lines as-is, and don't touch the 8254. (Or similar: only fix lines that don't risk breaking anything [everything except IRQ0?]) * For: This has no risk of breaking anything that currently works, and fixes my own problem. * Against: It somehow feels wrong to intentionally leave incorrect models for IRQ0 (both generation and consumption). It also feels wrong to add more assumptions about PC-style slave/master PIC and PIT wiring in the device model code. 2. Just fix the 8254 and 8259 directly, with nothing fancy for compatibility. My v8 (Dec 16) patch series does this. * For: simple * For: A single lost interrupt during migration between versions is probably not serious. * Against: Unless it looses a one-shot mode interrupt, like with a "tickless" Linux kernel. (Although if the guest gets some other interrupt "soon", and it resets the 8254 as part of handling that interrupt, then maybe this isn't serious either.) * For: But on the other hand, I get the impression that anything new enough to do such a "tickless" thing is likely to prefer using some other timer besides the 8254. But maybe this impression is wrong? 3. Fix it in stages, with years between stages, where it can safely migrate between adjacent stages, but not as safely migrate between skipped stages. My v7 (Nov 25) patch series was attempting one way of doing this, although there are probably other possibilities. 4. Fix the 8254 and 8259, but add in some migration logic to force IRQ0's pic->last_irr bit to 0 when migrating. * For: Avoid lost interrupts. * Against: This basically sets the migration state as if we did not fix the IRQ0 line trailing edge handling consumption logic in the 8259 at all, which is subtly different from normal operation. I would tend prefer to avoid making post-migration state any different from normal operation state. * For: But maybe this could be the first stage of a multi-stage approach, where we can eventually eliminate the special logic in the migration someday in the future (when migrating to versions with the old models is no longer something we care about). So the ugly inconsistency would only be in the code for a few years. (See also strategy 3 above.) * Against: In the qemu model (but not the kvm model), clearing last_irr may occasionally cause an EXTRA interrupt for mode 4 during part of the output cycle. [Sequence: Old version raises IRQ0 one clock tick early, guest migrates, interrupt is handled, then new version raises IRQ0 at the correct time...] * Question: How to handle the migration hack for the KVM model? It may needs to do this correction both in the kernel and in qemu, which feels ugly. Or potentially add new ioctl()'s (and fallback code if the ioctl is missing) to support both the "compatible" state and the "actual" state, but then that is drifting towards option 5 below: 5. Make the IRQ0 generation/consumption behavior user-selectable with a command line option (the cross-version migration "promise" is allowed to be broken if qemu is invoked with different options). Sometime in the future, perhaps change the default, and/or eliminate the old option. * For: By defaulting to current model behavior, we could avoid breaking anything that currently depends on current behavior in hard-to-forsee ways. * Against: More complicated code, and depending on details may either be slightly slower (more conditional branches in common code paths), or even more complicated code (another level of abstraction and more C-style virtual methods...). * Against: The non-default case probably wont be tested much, and risks bit-rotting. * Question: What should the command line option look like? Anyone have any preference for any of these? Or other alternatives? ------------------- Note that my tentative assignment of requirement priorities is as described below. High priority goals: 1. Handle the "trailing edge" of IRQ2 in the qemu 8259 model correctly, specifically to fix the case of a guest (Microport UNIX) that dynamically manipulated the IMR (mask register) of the slave while an IRQ line is asserted and unhandled. 2. Do not break any common use cases, at least not badly (perhaps a single lost or gained interrupt would be OK as long as it doesn't cause longer term breakage in a guest). * all device models should still work * basic operation of modern/common guest operating systems * cross version migration when KVM is in use Moderate priority goals (I wouldn't mind if these weren't met, but I would like to try to meet them): 3. Support cross version migration with plain qemu (not KVM). (or is this high priority like KVM migration?) 4. Completely fix all IRQ lines in both 8259 models. (I only really need the cascade line (IRQ2) to be fixed.) 5. Update the KVM models to continue to match the qemu models, at least to the same extent they are similar now. (I don't actually care about the KVM 825* models much, but I would like to avoid making them any less consistent. [see also 7 below]) 6. Improve the output logic of 8254 models to match the spec better, instead of depending on the broken trailing edge logic in the 8259 models in order to work. And as long as it is changing, fix some other small related details, like the exact clock tick on which edges occur, some GATE stuff, etc. Low priority (non)-goals (I'm not inclined to worry about these): 7. Enhance the KVM 8254 model to be able to model the leading and trailing edges of the output line at distinct times (similar to the qemu 8254 model). Perhaps it only does this when it isn't trying to "catch up" with missed interrupts. It could still do zero-time pulses when catching up. 8. Accurately model the 8254's GATE line. PC hardware hardwires it in the main channel 0, so modeling changes to it isn't particularly important. Currently the models don't have a good way to accurately model a paused countdown when GATE is low. * (exception): However, there is already some imperfect handling of GATE in the 8254 models, and the speaker channel has a software-controlled GATE. So I don't want to make GATE any worse than it already is, and if I can make partial improvements by changing just a couple of lines, I'm inclined to do so as part of item 6 above. - Matthew Ogilvie > > > rather than make IRQ0 work subtly differently normally vs during > > migration. One of my earlier patch series (qemu v7 dated Nov 25) > > attempts to use basically this strategy for the qemu model, at > > least in the short term (and then potentially fix it properly > > in the longer term), although the details of series might > > be tweaked. > > > > Or the minimal-risk strategy is to ONLY fix the cascade IRQ2's > > trailing edge [the original i825* problem that spawned this whole > > thing], leaving all other IRQs as-is. > > > > > > > > Still do not see how can we gain one interrupt. > > > > In most cases I don't think we get an extra interrupt from > > a direct fix. But some kinds of attempts to work around missing > > interrupts during migration can cause cause extra interrupts in other > > cases. In the old qemu model (but perhaps not kvm), the mode 2 leading > > edge occurs one clock tick earlier than it ought to. So if you > > try to be tricky with adjusting levels during migration, you > > may introduce possible cases where it gets an interrupt in > > the old model, then migrates and gets another interrupt one tick > > later in the new model... > > > > Also, it occurs to me that maybe there might be an off-by-one issue in > > the kvm model of mode 2 that ought to be fixed as well? Although the > > zero length pulse in kvm suggests that one-off issues in counters do > > not matter. In the qemu model, the leading edge of OUT in mode 2 > > moves by one tick... > > > > > > > > > The qemu 8254 model actually models each edge at independent > > > > clock ticks instead of combining both into a very brief pulse at one time. > > > > I've found it handy to draw out old and new timing diagrams on paper > > > > (for each mode), and then carefully think about what happens with respect > > > > to levels and edges when you transition back and forth between old and > > > > new models at various points in the timing cycle. (Note I've spent more > > > > time examining the qemu models rather than the kvm models.) > > > > > > > Yes, drawing it definitely helps :) > > > > > > > >> this is likely to be serious is probably losing a single-shot (mode 4) > > > > >> interrupt, but if my understanding of how things work is good, then > > > > >> that should only be possible if a whole slew of conditions are > > > > >> all met: > > > > >> > > > > >> 1. The guest is configured to run in a "tickless" mode (like > > > > >> modern Linux). > > > > >> 2. The guest is for some reason still using the i8254 rather > > > > >> than something more modern like an HPET. (The combination > > > > >> of 1 and 2 should be rare.) > > > > > This is not so rare. For performance reason it is better to not have > > > > > HPET at all. In fact -no-hpet is how I would advice anyone to run qemu. > > > > > > > > In a later email you mention that Linux prefers a timer in the APIC. > > > > I don't know much about the APIC (advanced interrupt controller), and > > > > wasn't > > > > even aware had it's own timer. > > > > > > > > The big question is if we can safely just fix the i825* models, or if > > > > we need something more subtle to avoid breaking commonly used guests > > > > like modern Linux (support both corrected and old models, > > > > or only fix IRQ2 instead of all IRQs, or similar subtlety). > > > Migration may happen while guest is running firmaware. Who knows what > > > those are doing. If the fix is as easy as I described above we should go > > > for it. > > > > > > > > > > > > > > > > >> 3. The migration is going from a fixed version back to the > > > > >> old version. (Not sure how common this is, but it should > > > > >> be rarer than migrating from old to new.) > > > > >> 4. There are not going to be any "timely" events/interrupts > > > > >> (keyboard, network, process sleeps, etc) that cause the guest > > > > >> to reset the PIT mode 4 one-shot counter "soon enough". > > > > >> > > > > >> This combination should be rare enough that more complicated > > > > >> solutions are not worth the effort. > > > > >> > > > > >> Signed-off-by: Matthew Ogilvie <mmogilvi_qemu@xxxxxxxxxxxx> > > > > >> --- > > > > >> arch/x86/kvm/i8254.c | 6 +++++- > > > > >> 1 file changed, 5 insertions(+), 1 deletion(-) > > > > >> > > > > >> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c > > > > >> index c1d30b2..cd4ec60 100644 > > > > >> --- a/arch/x86/kvm/i8254.c > > > > >> +++ b/arch/x86/kvm/i8254.c > > > > >> @@ -290,8 +290,12 @@ static void pit_do_work(struct kthread_work *work) > > > > >> } > > > > >> spin_unlock(&ps->inject_lock); > > > > >> if (inject) { > > > > >> - kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1); > > > > >> + /* Clear previous interrupt, then create a rising > > > > >> + * edge to request another interupt, and leave it at > > > > >> + * level=1 until time to inject another one. > > > > >> + */ > > > > >> kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0); > > > > >> + kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1); > > > > >> > > > > >> /* > > > > >> * Provides NMI watchdog support via Virtual Wire mode. > > > > >> -- > > > > >> 1.7.10.2.484.gcd07cc5 > > > > > > > > > > -- > > > > > Gleb. > > > > > > -- > > > Gleb. > > -- > Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html