Re: [PATCH 5/5] Fix kdump under KVM.

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Fri, 30 Oct 2009 20:19:42 -0200

On Fri, Oct 30, 2009 at 04:28:01PM +0100, Chris Lalancette wrote:
> Marcelo Tosatti wrote:
> > On Fri, Oct 30, 2009 at 01:23:57PM +0100, Chris Lalancette wrote:
> >> Marcelo Tosatti wrote:
> >>> On Wed, Oct 28, 2009 at 11:21:42AM +0100, Chris Lalancette wrote:
> >>>> Marcelo Tosatti wrote:
> >>>>> On Tue, Oct 27, 2009 at 05:41:07PM +0100, Chris Lalancette wrote:
> >>>>>> This patch is the main point of the series.  In order for
> >>>>>> kdump to properly work inside a KVM guest, we need to make
> >>>>>> sure that all VCPUs in virtual wire APIC mode get kicked
> >>>>>> to try and pick up the timer interrupts.  To do this,
> >>>>>> we iterate over the CPUs and deliver interrupts to the
> >>>>>> proper VCPUs.
> >>>>>>
> >>>>>> I don't love the concept of doing kvm_irq_kick_vcpus() from
> >>>>>> within pit_timer_fn().  A PIT is not connected to a CPU at all,
> >>>>>> only to a PIC or APIC.  However, if a CPU enters idle, this is
> >>>>>> the only way to wake it up to check for the interrupt.
> >>>>> The reason the PIT interrupt was fixed to BSP is:
> >>>>>
> >>>>> http://www.mail-archive.com/kvm-devel@xxxxxxxxxxxxxxxxxxxxx/msg13250.html
> >>>>>
> >>>>> Perhaps enhancing ioapic_deliver's (irq == 0) check to bypass the       
> >>>>> destination overwrite in case its programmed by the guest to 
> >>>>> a single CPU would fix it?
> >>>> Ug, nasty.  I definitely don't want to re-introduce that bug.  What exactly do
> >>>> you mean by "fix it"?  Do you mean fix the original RHEL-5.1 PAE issue, or fix
> >>>> the kdump issue?
> >>> The kdump issue. Something like:
> >>>
> >>> #ifdef CONFIG_X86
> >>>         /* Always delivery PIT interrupt to vcpu 0 */
> >>>         if (irq == 0 && dest_multiple_vcpus(entry)) {
> >>>                 irqe.dest_mode = 0; /* Physical mode. */
> >>>                 /* need to read apic_id from apic regiest since
> >>>                  * it can be rewritten */
> >>>                 irqe.dest_id = ioapic->kvm->bsp_vcpu->vcpu_id;
> >>>         }
> >>> #endif
> >>>
> >>> The rest of the code should be ready to deal with it.
> >> Do you have any more information about how to reproduce that original issue?  I
> >> tried installing a RHEL-5.1 i386 guest with a PAE kernel, and I was able to
> >> install and boot it just fine with or without the "always deliver irq 0 to vcpu
> >> 0" check above.  Either there's something in particular I need to do to trigger
> >> the issue (hardware or software), or it's being solved another way.
> > 
> > For VMX, the guests TSC's are now usually synchronized (which was not
> > the case before, when this patch was written). For AMD, they start out
> > of sync.
> > 
> > Migration also causes them to be out of sync. So i'd prefer to postpone
> > the removal of this limitation until there is a guarantee the TSCs 
> > are synchronized across vcpus.
> 
> OK.  I'll try to get on an unsynchronized TSC machine and see if I can reproduce
> there.
> 
> > 
> >> In the meantime, I've gotten the "set_irq from IRQ context" that Avi suggested
> >> working, and the fixing up of this IOAPIC check is the last bit to actually get
> >> kdump working.
> > 
> > The kdump fix does not require that, I believe (although it is
> > useful/required for other things, please send the patch separately).
> > 
> > The hrtimer will fire on vcpu0, then:
> > 
> > kvm_inject_pit_interrupts -> ioapic_set_irq -> apic_deliver ->
> > kvm_vcpu_kick(vcpuN)
> > 
> > As long as the dest-id = vcpu_id overwrite change is in place.
> 
> Just to follow up here, I am taking your suggestion about testing for dest_id to
> multiple vcpus in the IOAPIC, and it seems to be working so far.  However, that
> is not enough to fix kdump.  We also need to remove the BSP assumptions in the
> i8254 code, otherwise we will never even attempt to deliver the interrupt to
> !vcpu0.

But the limitation is only in ioapic_deliver. vcpu0 will process the
hrtimer interrupt on its entry path (through __inject_pit_timer_intr),
call into the ioapic code via kvm_set_irq, and follow the normal apic
interrupt delivery path (which includes waking up the target vcpu(s)).

> That means that we need code for Avi's suggestion about kvm_set_irq for
> this to all work.

What happens now is the kvm_set_irq work is "scheduled" to execute in
vcpu0 entry context.

Regarding KVM_REQ_PENDING_TIMER, it is implicitly used in
vcpu_enter_guest here:

        if (vcpu->requests || need_resched() || signal_pending(current))

> Attached is the patch that I'm currently testing if anyone else wants to look at
> it.  I obviously need to break it up into a nice series, but that will have to
> wait until next week.  My "ioapic_dest_multiple_cpus()" function could do with a
> "popcnt", but I couldn't find a macro after a quick look around the tree.
> What's the recommended way to count bits in Linux?

There's a hweight_long helper.

Hum, suppose its safe to limit the irq0->bsp override to ioapic delivery
mode lowest priority (so you don't need to compare IOAPIC destination id
with the local apic ID's).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html