Re: [RFC][PATCH]: kdump on KVM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/04/2010 11:11 PM, Chris Lalancette wrote:
Hello,
      I've gone back to look at the problem with kdump while running as a KVM
guest.  I'll preface this discussion by noting that all of the problems I
describe below only happen when the HPET emulation is shutoff.  However, since
there still seem to be some lingering issues with the HPET emulation, running
without the HPET is a reasonable mode of operation, and one in which it would
be nice to have kexec/kdump working.
      The basic problem with getting kdump working is that currently KVM
delivers the timer interrupts only to vcpu 0, so kexecing onto another
processor will never see timer interrupts.  In the KVM code, there are a
number of places where we make incorrect assumptions regarding the timer and
which vcpu to deliver it to:

1)  In the i8254.c code when the hrtimer representing the PIT expires.  In
this case, when we get the callback, we kick only the BSP.

2)  In the i8254.c code when a vcpu is migrated from one processor to another.
In this case we only migrate the PIT timer if the vcpu to be migrated is the
BSP.

3)  In the lapic code when deciding whether to accept a PIC interrupt, we only
accept interrupts on the BSP.

4)  In the irq_comm.c code when calling kvm_irq_delivery_to_apic().  The
problem here is that we don't take into account the fact that an LAPIC might
be disabled when trying to deliver an interrupt in DM_LOWEST mode.  Further,
on a kexec, the processor that we are kexec'ing *to* gets it's APIC ID
re-written to the BSP APIC ID.  What it means in the end is that we are
currently still matching against the BSP even though vcpu 1 (where the kexec
is happening) would match if we let it.

The attached patch takes care of 1), 3), and 4), and works in my testing (it
needs to be cleaned up a bit to not be so inefficient, but it should work).
However, problem 2) is pretty sticky.  The reason we are currently migrating
the PIT timer around with the BSP is pretty well explained in commit
2f5997140f22f68f6390c49941150d3fa8a95cb7.  With my new patch, though, we are
no longer guaranteeing that we are going to inject onto CPU 0.  I think we
can do something where when the hrtimer expires, we figure out which
vcpu will get the timer interrupt and IPI to that physical processor to cause
the VMEXIT.  Unfortunately it's racy because the expiration of the hrtimer is
de-coupled from the setting of the IRQ for the interrupt.

It's only racy because of (perhaps not immature) optimization. Logically, the flow of events is as follows:

1. the httimer fires
2. irq0 is toggled (perhaps via a workqueue since the hrtimer is in interrupt context) 3. the PIC and IOAPIC state machines run and decide which local APIC(s), if any, need to be notified
4. the local APICs are IPIed.

Perhaps we should start by deoptimizing the code to work like the above.

In light of the below, I'm not sure the race actually exists.

  That means that
the hrtimer could  expire, we could choose vcpu 2 (say), IPI to cause a
VMEXIT, but by the time it goes to VMENTER the guest has changed something in
the (IO)APIC(s) so that the set_irq logic chooses vcpu 3 to do the injection.
This would result in a delayed injection that 2f5997 is trying to avoid.

Not sure that's the race here. Between the vmexit and vmenter we have the following trace:

  kvm_inject_pending_timer_irqs()
  kvm_inject_pit_timer_irqs()
  __inject_pit_timer_intr()
  kvm_set_irq()

So, even if we end up on the wrong vcpu, we will IPI the right one later.

Taking a step back, it seems to me that something along the lines of my
previous patchset (where we do set_irq directly from the hrtimer callback) is
the right way to go.  We would still need to IPI to the appropriate physical
cpu to cause a VMEXIT on the cpu we care about, but we would avoid the race I
describe in the previous paragraph.

I agree. But we also want the vcpu that accepts the interrupt to pull the hrtimer after it (like it does now) to minimize cross-cpu talk.

Unfortunately that patchset is also much
more risky.

I'd like a minimal, backportable patchset first, then bigger changes later. We have at least two choices, irq-safe pit and ioapic, and using a workqueue for to match impedances. The latter is a lot safer since we already do it for assigned devices.

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 3c4f8f6..bc40143 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -231,7 +231,13 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
  {
  	struct kvm_pit *pit = vcpu->kvm->arch.vpit;

-	if (pit&&  kvm_vcpu_is_bsp(vcpu)&&  pit->pit_state.irq_ack)
+	/* We only want to inject a single IRQ for each PIT timer that goes
+	 * off.  If 2 vcpus happen to come in here at the same time, we may
+	 * erroneously tell one of them that it has a timer pending.  That's
+	 * OK, though; in kvm_inject_pit_timer_irqs we will make sure to only
+	 * let one of them inject.
+	 */
+	if (pit&&  pit->pit_state.irq_ack)
  		return atomic_read(&pit->pit_state.pit_timer.pending);
  	return 0;
  }
@@ -277,6 +283,51 @@ static struct kvm_timer_ops kpit_ops = {
  	.is_periodic = kpit_is_periodic,
  };

+static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
+{
+	int restart_timer = 0;
+	struct kvm_vcpu *vcpu, *this;
+	int i;
+	struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
+
+	if (!ktimer->vcpu)
+		return HRTIMER_NORESTART;
+
+	this = NULL;
+	kvm_for_each_vcpu(i, vcpu, ktimer->kvm) {
+		if (kvm_lapic_enabled(vcpu)) {
+			this = vcpu;
+			break;
+		}
+	}
+	if (this == NULL)
+		this = ktimer->kvm->bsp_vcpu;

If lapics are active, 'this' is chosen as the highest numbered lapic.

+
+	/*
+	 * There is a race window between reading and incrementing, but we do
+	 * not care about potentially loosing timer events in the !reinject
+	 * case anyway.
+	 */
+	if (ktimer->reinject || !atomic_read(&ktimer->pending)) {
+		atomic_inc(&ktimer->pending);
+		/* FIXME: this code should not know anything about vcpus */
+		set_bit(KVM_REQ_PENDING_TIMER,&this->requests);

The comment is correct...

Possible fix:
- audit the code to make sure that it will work regardless of which vcpu it is tied to
- install an irq ack notifier for the pit interrupt
- pull the pit hrtimer towards the vcpu that last acked it

+	}
+
+	if (waitqueue_active(&this->wq))
+		wake_up_interruptible(&this->wq);

Need an IPI if the vcpu is in guest mode and we set KVM_REQ_PENDING_TIMER earlier.

+
+	if (ktimer->t_ops->is_periodic(ktimer)) {
+		hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
+		restart_timer = 1;
+	}
+
+	if (restart_timer)
+		return HRTIMER_RESTART;
+	else
+		return HRTIMER_NORESTART;

Unrelated: as soon as ->pending starts to accumulate, we should return HRTIMER_NORESTART and calculate pending from the period and current time. This will reduce hrtimers if the system is overloaded, or if the guest masked the pit and uses some other time source.

@@ -1106,13 +1112,11 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
  	u32 lvt0 = apic_get_reg(vcpu->arch.apic, APIC_LVT0);
  	int r = 0;

-	if (kvm_vcpu_is_bsp(vcpu)) {
-		if (!apic_hw_enabled(vcpu->arch.apic))
-			r = 1;
-		if ((lvt0&  APIC_LVT_MASKED) == 0&&
-		    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
-			r = 1;
-	}
+	if (!apic_hw_enabled(vcpu->arch.apic))
+		r = 1;
+	if ((lvt0&  APIC_LVT_MASKED) == 0&&
+	    GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
+		r = 1;
  	return r;
  }

This is the crux, yes? I think that if we pull the vcpu calculation out of pit_timer_fn into an ack notifier, we can end up with the original kvm_timer_fn.


diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9fd5b3e..c492bae 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -98,7 +98,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
  			if (r<  0)
  				r = 0;
  			r += kvm_apic_set_irq(vcpu, irq);
-		} else {
+		} else if (kvm_lapic_enabled(vcpu)){
  			if (!lowest)
  				lowest = vcpu;
  			else if (kvm_apic_compare_prio(vcpu, lowest)<  0)

Unrelated fix?

The patch is large for such a ticklish subject. I'd be more than happy to remove the vcpu affinity optimization, fix the code, and reinstate the optimization later, since it will make the patch(es) much easier to review.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux