Re: [RFC][PATCH]: kdump on KVM

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Mon, 7 Jun 2010 11:35:38 -0300

On Sun, Jun 06, 2010 at 11:01:23AM +0300, Avi Kivity wrote:
> On 06/04/2010 11:11 PM, Chris Lalancette wrote:
> >Hello,
> >      I've gone back to look at the problem with kdump while running as a KVM
> >guest.  I'll preface this discussion by noting that all of the problems I
> >describe below only happen when the HPET emulation is shutoff.  However, since
> >there still seem to be some lingering issues with the HPET emulation, running
> >without the HPET is a reasonable mode of operation, and one in which it would
> >be nice to have kexec/kdump working.
> >      The basic problem with getting kdump working is that currently KVM
> >delivers the timer interrupts only to vcpu 0, so kexecing onto another
> >processor will never see timer interrupts.  In the KVM code, there are a
> >number of places where we make incorrect assumptions regarding the timer and
> >which vcpu to deliver it to:
> >
> >1)  In the i8254.c code when the hrtimer representing the PIT expires.  In
> >this case, when we get the callback, we kick only the BSP.
> >
> >2)  In the i8254.c code when a vcpu is migrated from one processor to another.
> >In this case we only migrate the PIT timer if the vcpu to be migrated is the
> >BSP.
> >
> >3)  In the lapic code when deciding whether to accept a PIC interrupt, we only
> >accept interrupts on the BSP.
> >
> >4)  In the irq_comm.c code when calling kvm_irq_delivery_to_apic().  The
> >problem here is that we don't take into account the fact that an LAPIC might
> >be disabled when trying to deliver an interrupt in DM_LOWEST mode.  Further,
> >on a kexec, the processor that we are kexec'ing *to* gets it's APIC ID
> >re-written to the BSP APIC ID.  What it means in the end is that we are
> >currently still matching against the BSP even though vcpu 1 (where the kexec
> >is happening) would match if we let it.
> >
> >The attached patch takes care of 1), 3), and 4), and works in my testing (it
> >needs to be cleaned up a bit to not be so inefficient, but it should work).
> >However, problem 2) is pretty sticky.  The reason we are currently migrating
> >the PIT timer around with the BSP is pretty well explained in commit
> >2f5997140f22f68f6390c49941150d3fa8a95cb7.  With my new patch, though, we are
> >no longer guaranteeing that we are going to inject onto CPU 0.  I think we
> >can do something where when the hrtimer expires, we figure out which
> >vcpu will get the timer interrupt and IPI to that physical processor to cause
> >the VMEXIT.  Unfortunately it's racy because the expiration of the hrtimer is
> >de-coupled from the setting of the IRQ for the interrupt.
> 
> It's only racy because of (perhaps not immature) optimization.
> Logically, the flow of events is as follows:
> 
> 1. the httimer fires
> 2. irq0 is toggled (perhaps via a workqueue since the hrtimer is in
> interrupt context)
> 3. the PIC and IOAPIC state machines run and decide which local
> APIC(s), if any, need to be notified
> 4. the local APICs are IPIed.
> 
> Perhaps we should start by deoptimizing the code to work like the above.
> 
> In light of the below, I'm not sure the race actually exists.
> 
> >  That means that
> >the hrtimer could  expire, we could choose vcpu 2 (say), IPI to cause a
> >VMEXIT, but by the time it goes to VMENTER the guest has changed something in
> >the (IO)APIC(s) so that the set_irq logic chooses vcpu 3 to do the injection.
> >This would result in a delayed injection that 2f5997 is trying to avoid.
> 
> Not sure that's the race here.  Between the vmexit and vmenter we
> have the following trace:
> 
>   kvm_inject_pending_timer_irqs()
>   kvm_inject_pit_timer_irqs()
>   __inject_pit_timer_intr()
>   kvm_set_irq()
> 
> So, even if we end up on the wrong vcpu, we will IPI the right one later.
> 
> >Taking a step back, it seems to me that something along the lines of my
> >previous patchset (where we do set_irq directly from the hrtimer callback) is
> >the right way to go.  We would still need to IPI to the appropriate physical
> >cpu to cause a VMEXIT on the cpu we care about, but we would avoid the race I
> >describe in the previous paragraph.
> 
> I agree.  But we also want the vcpu that accepts the interrupt to
> pull the hrtimer after it (like it does now) to minimize cross-cpu
> talk.
> 
> >Unfortunately that patchset is also much
> >more risky.
> 
> I'd like a minimal, backportable patchset first, then bigger changes
> later.  We have at least two choices, irq-safe pit and ioapic, and
> using a workqueue for to match impedances.  The latter is a lot
> safer since we already do it for assigned devices.
> 
> >diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> >index 3c4f8f6..bc40143 100644
> >--- a/arch/x86/kvm/i8254.c
> >+++ b/arch/x86/kvm/i8254.c
> >@@ -231,7 +231,13 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
> >  {
> >  	struct kvm_pit *pit = vcpu->kvm->arch.vpit;
> >
> >-	if (pit&&  kvm_vcpu_is_bsp(vcpu)&&  pit->pit_state.irq_ack)
> >+	/* We only want to inject a single IRQ for each PIT timer that goes
> >+	 * off.  If 2 vcpus happen to come in here at the same time, we may
> >+	 * erroneously tell one of them that it has a timer pending.  That's
> >+	 * OK, though; in kvm_inject_pit_timer_irqs we will make sure to only
> >+	 * let one of them inject.
> >+	 */
> >+	if (pit&&  pit->pit_state.irq_ack)
> >  		return atomic_read(&pit->pit_state.pit_timer.pending);
> >  	return 0;
> >  }
> >@@ -277,6 +283,51 @@ static struct kvm_timer_ops kpit_ops = {
> >  	.is_periodic = kpit_is_periodic,
> >  };
> >
> >+static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
> >+{
> >+	int restart_timer = 0;
> >+	struct kvm_vcpu *vcpu, *this;
> >+	int i;
> >+	struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
> >+
> >+	if (!ktimer->vcpu)
> >+		return HRTIMER_NORESTART;
> >+
> >+	this = NULL;
> >+	kvm_for_each_vcpu(i, vcpu, ktimer->kvm) {
> >+		if (kvm_lapic_enabled(vcpu)) {
> >+			this = vcpu;
> >+			break;
> >+		}
> >+	}
> >+	if (this == NULL)
> >+		this = ktimer->kvm->bsp_vcpu;
> 
> If lapics are active, 'this' is chosen as the highest numbered lapic.
> 
> >+
> >+	/*
> >+	 * There is a race window between reading and incrementing, but we do
> >+	 * not care about potentially loosing timer events in the !reinject
> >+	 * case anyway.
> >+	 */
> >+	if (ktimer->reinject || !atomic_read(&ktimer->pending)) {
> >+		atomic_inc(&ktimer->pending);
> >+		/* FIXME: this code should not know anything about vcpus */
> >+		set_bit(KVM_REQ_PENDING_TIMER,&this->requests);
> 
> The comment is correct...
> 
> Possible fix:
> - audit the code to make sure that it will work regardless of which
> vcpu it is tied to
> - install an irq ack notifier for the pit interrupt
> - pull the pit hrtimer towards the vcpu that last acked it
> 
> >+	}
> >+
> >+	if (waitqueue_active(&this->wq))
> >+		wake_up_interruptible(&this->wq);
> 
> Need an IPI if the vcpu is in guest mode and we set
> KVM_REQ_PENDING_TIMER earlier.
> 
> >+
> >+	if (ktimer->t_ops->is_periodic(ktimer)) {
> >+		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
> >+		restart_timer = 1;
> >+	}
> >+
> >+	if (restart_timer)
> >+		return HRTIMER_RESTART;
> >+	else
> >+		return HRTIMER_NORESTART;
> 
> Unrelated: as soon as ->pending starts to accumulate, we should
> return HRTIMER_NORESTART and calculate pending from the period and
> current time.  This will reduce hrtimers if the system is
> overloaded, or if the guest masked the pit and uses some other time
> source.
> 
> >@@ -1106,13 +1112,11 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
> >  	u32 lvt0 = apic_get_reg(vcpu->arch.apic, APIC_LVT0);
> >  	int r = 0;
> >
> >-	if (kvm_vcpu_is_bsp(vcpu)) {
> >-		if (!apic_hw_enabled(vcpu->arch.apic))
> >-			r = 1;
> >-		if ((lvt0&  APIC_LVT_MASKED) == 0&&
> >-		    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
> >-			r = 1;
> >-	}
> >+	if (!apic_hw_enabled(vcpu->arch.apic))
> >+		r = 1;
> >+	if ((lvt0&  APIC_LVT_MASKED) == 0&&
> >+	    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
> >+		r = 1;
> >  	return r;
> >  }
> 
> This is the crux, yes?  I think that if we pull the vcpu calculation
> out of pit_timer_fn into an ack notifier, we can end up with the
> original kvm_timer_fn.

I see disable_IO_APIC puts the IOAPIC in virtual wire mode with dest_id
== bsp apic id, yes? 

So the above should be enough? 

As Avi mentioned having the hrtimer tick on vcpu0 should be fine since
the entry path will inject the interrupt to the proper LAPIC.

> 
> >
> >diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> >index 9fd5b3e..c492bae 100644
> >--- a/virt/kvm/irq_comm.c
> >+++ b/virt/kvm/irq_comm.c
> >@@ -98,7 +98,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> >  			if (r<  0)
> >  				r = 0;
> >  			r += kvm_apic_set_irq(vcpu, irq);
> >-		} else {
> >+		} else if (kvm_lapic_enabled(vcpu)){
> >  			if (!lowest)
> >  				lowest = vcpu;
> >  			else if (kvm_apic_compare_prio(vcpu, lowest)<  0)
> 
> Unrelated fix?
> 
> The patch is large for such a ticklish subject.  I'd be more than
> happy to remove the vcpu affinity optimization, fix the code, and
> reinstate the optimization later, since it will make the patch(es)
> much easier to review.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html