Avi Kivity <avi@xxxxxxxxxx> writes: > On 06/30/2009 10:36 PM, Eric W. Biederman wrote: >>>> The short version is I don't know what work arounds we will ultimately >>>> decide to deploy to work with real hardware. >>>> >>>> I have been seriously contemplating causing a cpu hot-unplug request >>>> to fail if we are in ioapic mode and we have irqs routed to the cpu >>>> that is being unplugged. >>>> >>>> >>> Well, obviously we need to disassociate any irqs from such a cpu. Could be done >>> from the kernel or only enforced by the kernel. >>> >> >> Using the normal irq migration path we can move irqs off of a cpu reliably >> there just aren't any progress guarantees. >> > > Program the ioapic to the new cpu. Wait a few milliseconds. If it takes more > than that to get an interrupt from the ioapic to the local apic, the machine has > much bigger problems. In general you can not reprogram an ioapic safely unless the interrupt is blocked at the source. Which is why you either need the originating device disabled or you have to do it in interrupt context. I forget all of the details. I just know in real hardware I experimented with it a lot, and wound up hanging the ioapic state machine of both intel and amd ioapics. Migrating ioapic irqs in interrupt context sucks. It just happens to be what works reliably. I do think the wait an eternity in computer time a short while in human time is a valid technique when you can do nothing better. If flushing the interrupt was my only problem that would solve it. >>>> Even with perfectly working hardware it is not possible in the general >>>> case to migrate an ioapic irq from one cpu to another outside of an >>>> interrupt handler without without risking dropping an interrupt. >>>> >>>> >>> Can't you generate a spurious interrupt immediately after the migration? An >>> extra interrupt shouldn't hurt. >>> >> >> Nope. The ioapics can't be told to send an interrupt. >> > > You can program the local apic ICR to generate an interrupt with the same > vector. But you can not program the apic ICR to generate a level triggered interrupt with the same vector. So you don't get the broadcast behavior when you ack the apic. >>>> There is no general way to know you have seen the last interrupt >>>> floating around your system. PCI ordering rules don't help because >>>> the ioapics can potentially take an out of band channel. >>>> >>>> >>> Can you describe the problem scenario? an ioapic->lapic message delivered to a >>> dead cpu? >>> >> >> Dropped irqs.. Driver hangs because it is waiting for an irq. Hardware >> hangs because it is waiting for the cpu to process the irq. >> >> Potentially we get a level triggered irq that is never acked by >> the cpu that won't arm until the cpu send an ack, and we can't >> send an ack from another cpu. >> >> > > I think a spurious interrupt generated through the local apic solves that > problem. For level-triggered interrupts, mask them before offlining the cpu. So now we have a masked unacked irq. It doesn't help in the slightest that the cpu migration code puts irq migration last and request that we do it all with interrupts disabled. You might be right that by application of extreme ingenuity and completely in spec ioapics there is a path that makes this all work. Currently I don't have fully in spec ioapcis, and I don't have anyone interested enough in cpu hotplug to be willing to rip things apart until interrupt migration is a reasonable deal on x86. Instead all I see are patches that mitigate the worst of the brokenness. At the same time with the interrupt remapping hardware because it doesn't need the irq disabled at the source when we reprogram it I can make everything stable much more easily. Eric -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html