On 22/06/12 17:47, Christoffer Dall wrote: > On Fri, Jun 22, 2012 at 10:22 AM, Marc Zyngier <marc.zyngier at arm.com> wrote: >> On 22/06/12 04:00, Christoffer Dall wrote: >>> On Mon, May 14, 2012 at 9:05 AM, Marc Zyngier <marc.zyngier at arm.com> >>> wrote: >>>> Add VGIC virtual CPU interface code, picking pending interrupts >>>> from the distributor and stashing them in the VGIC control interface >>>> list registers. >>>> >>>> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com> >>>> --- >>>> arch/arm/include/asm/kvm_vgic.h | 30 +++++ >>>> arch/arm/kvm/vgic.c | 230 >>>> ++++++++++++++++++++++++++++++++++++++- >>>> 2 files changed, 259 insertions(+), 1 deletions(-) >>>> >>>> diff --git a/arch/arm/include/asm/kvm_vgic.h >>>> b/arch/arm/include/asm/kvm_vgic.h >>>> index dca2d09..939f3f8 100644 >>>> --- a/arch/arm/include/asm/kvm_vgic.h >>>> +++ b/arch/arm/include/asm/kvm_vgic.h >>>> @@ -106,14 +106,44 @@ struct vgic_dist { >>>> }; >>>> >>>> struct vgic_cpu { >>>> +#ifdef CONFIG_KVM_ARM_VGIC >>>> + spinlock_t lock; >>>> + >>>> + u8 vgic_irq_lr_map[VGIC_NR_IRQS]; /* per IRQ to LR >>>> mapping */ >>>> + u8 vgic_lr_irq_map[64]; /* per LR to IRQ >>>> mapping */ >>> >>> if someone changed the VGIC_NR_IRQS to higher than 256, you're in trouble >>> here. >> >> Good point. Two option then: turning vgic_lr_irq_map[] into a u16 array, >> or limiting VGIC_NR_IRQS to 256. I think this is high enough, but who >> knows? >> > > probably high enough, but then we should have something like > > #if VGIC_NR_IRQS > 256 > #error "blah" > #endif > > but it seems like this problem will magically go away from dropping the array. Yes, it's now gone from my tree. >>>> diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c >>>> index 1ace859..ccd8b69 100644 >>>> --- a/arch/arm/kvm/vgic.c >>>> +++ b/arch/arm/kvm/vgic.c >>>> @@ -416,12 +416,35 @@ static void vgic_dispatch_sgi(struct kvm_vcpu >>>> *vcpu, u32 reg) >>>> >>>> static int compute_pending_for_cpu(struct kvm_vcpu *vcpu) >>>> { >>>> - return 0; >>>> + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; >>>> + unsigned long *pending, *enabled, *pend; >>>> + int vcpu_id; >>>> + >>>> + vcpu_id = vcpu->vcpu_id; >>>> + pend = vcpu->arch.vgic_cpu.pending; >>> >>> why are we doing this work on the vgic structure? couldn't it just be >>> a local variable here? why is this pending value from last call >>> preserved? >> >> We need that state when populating the list registers. >> > > but you call it from there again, so you could just set a pointer > being passed in? > > it just scares me when you store values that are somehow 'transient' > in nature (we should totally write this in ML or LISP). I guess it's a > way of avoiding an extra kmalloc, although it's a bitmap that should > always be statically allocatable by the caller (128 bytes max)? Well, having it as a per-vcpu structure makes the allocation a one-off cost. I'll turn it to a stack allocated bitmap if you feel it makes things cleaner. >>>> /* >>>> * Update the interrupt state and determine which CPUs have pending >>>> * interrupts. Must be called with distributor lock held. >>>> + * >>>> + * It would be very tempting to just compute the pending bitmap once, >>>> + * but that would make it quite ugly locking wise when a vcpu actually >>>> + * moves the interrupt to its list registers (think of a single >>>> + * interrupt pending on several vcpus). So we end up computing the >>>> + * pending list twice (once here, and once in __kvm_vgic_sync_to_cpu). >>> >>> I don't understand this comment (at least not yet). Why are we >>> calculating it here if we're calculating it in a different way, but >>> for the same, in another function. I guess I will try and find out. >> >> A single interrupt can be targeted at multiple vcpus (through >> irq_target). When we compute the global state by calling >> vgic_update_state(), we consider that such an interrupt is pending on >> all possible vcpus. Whoever services the interrupt first wins the race. >> >> Now, when a vcpu gets to be executed, we enter kvm_vgic_sync_to_cpu(). >> By virtue of holding both the distributor lock and the vgic_cpu lock, we >> know the state is frozen. Because this interrupt can have been serviced >> by another vcpu, we have to recompute the local vcpu state. >> >> If we don't want to recompute that state locally, then we'd have to >> somehow notice that we're handling an interrupt with multiple targets, >> acquire the other vcpu locks, and recompute their state! Net effect? Not >> much. >> >> What we could also do is consider that these interrupts are too rare to >> be cared about in an efficient manner, and give them a totally separate >> code path. >> > > Hmm, ok. I kind of understand. But how does all this work for SPIs > using the N-1 model? Is our approach not that we simply assign the SPI > to the vCPU that just happens to be doing the first world-switch after > that SPI was raised - if that CPU is masking interrupts and doing > something for a long time, the interrupt latency is really long, no? > > IOW is this not only a small step (if any) above just prematurely > choosing one of the candidate VCPUs for each SPI configured to target > more than one CPU (this should be allowed, if I read the note in > section 1.4.3 of the GICv2 specs correctly)? Hmmm. Actually, this may be our best option. And a Linux guest never uses the multiple target facility anyway. I'll have a go at only picking one CPU in the irq_target field and see how it simplifies things. >>>> +/* >>>> + * Fill the list registers with pending interrupts before running the >>>> + * guest. >>>> + */ >>>> +static void __kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu) >>>> +{ >>>> + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; >>>> + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; >>>> + unsigned long *pending; >>>> + int i, c, vcpu_id; >>>> + int overflow = 0; >>>> + >>>> + vcpu_id = vcpu->vcpu_id; >>>> + >>>> + /* >>>> + * We may not have any pending interrupt, or the interrupts >>>> + * may have been serviced from another vcpu. In all cases, >>>> + * move along. >>>> + */ >>>> + if (!kvm_vgic_vcpu_pending_irq(vcpu) || >>>> + !compute_pending_for_cpu(vcpu)) { >>> >>> I still don't see why there's a precomputed value here that we cannot >>> trust, so we recalculate it here (can we ever trust the precomputed >>> one?) >> >> See the detailed explanation above. >> >>> oh, but we can trust the precomputed value if there are no interrupts, >>> so we only have to recalculate if there are pending interrupts - what >>> is the explanation for this || here? >> >> Would you prefer the line below? >> if (!(kvm_vgic_vcpu_pending_irq(vcpu) && compute_pending_for_cpu(vcpu))) >> > > so let's say there's not something pending on the CPU from the LRs, > but there's something pending on the distributor side, then the > functions would return: > > kvm_vgic_vcpu_pending_irq: 0 > compute_pending_for_cpu: 1 > > if (!0 || !1) > goto epilog; > > which is the same as: > > if (1) > goto epilog; > > which clears the irq_pending_on_cpu and never copies the pending > distributor interrupt to the LR. > > so, perhaps the ordering is incorrect, or the check should be: > > /* nothing to transfer to LRs */ > if (!compute_pending_for_cpu(vcpu)) > goto epilog; > > /* do we want to do work if there's already pending LRs ? */ > if (kvm_vgic_vcpu_pending_irq(vcpu)) > /* either goto epilog or continue with the fun, choice */ > > why is this hurting my brain? Same here, and it's only Monday morning. I'll revisit this as it is obviously flawed. Thanks for the truth table! ;-) >>>> +void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu) >>>> +{ >>>> + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; >>>> + struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu; >>>> + >>>> + if (!irqchip_in_kernel(vcpu->kvm)) >>>> + return; >>>> + >>>> + spin_lock(&dist->lock); >>>> + spin_lock(&vgic_cpu->lock); >>>> + __kvm_vgic_sync_to_cpu(vcpu); >>>> + spin_unlock(&vgic_cpu->lock); >>>> + spin_unlock(&dist->lock); >>> >>> Two spin_locks for every guest entry/exit. Can we not avoid this >>> somehow? (so much work so far to have a lock-less entry in the common >>> case). >> >> Interrupt injection is inherently racy. You really don't know what is >> happening from userspace, or from another vcpu. I'm almost convinced we >> could remove vgic_cpu->lock, as we only mess with that data in the >> context of this vcpu. >> >> But the distributor lock is here to stay, I'm afraid. >> > > why is that? isn't everything bit operations that is serializable in > nature? (I'm thinking about the fact that the distributor as a device > doesn't have a lock, does it?) Well, I definitely expect the HW distributor to have hazard checking between MMIO accesses and external signaling. The guest itself should have its own locking to serialize concurrent CPU access though. I suppose we could use fancy stuff such as RCU to avoid the cost of a single spinlock, but how often will this spinlock be contended? It would take some profiling to find out, but I have the feeling that the contention will be very low (we always kick the vcpu once the lock has been released). >>>> + return !!(atomic_read(&dist->irq_pending_on_cpu) & (1 << >>>> vcpu->vcpu_id)); >>> >>> if this gets changed to test_bit, you should creep just below the 81 >>> characters width here ;) >> >> Let me check first if we can actually allow this not to be atomic. >> > > as far as I can see you should be fine with the bit operations, they > are atomic 'enough' :) Probably. Will give it a go. Thanks, M. -- Jazz is not dead. It just smells funny...