[Android-virt] [PATCH 05/15] ARM: KVM: VGIC virtual CPU interface management

marc.zyngier at arm.com (Marc Zyngier) · Mon, 25 Jun 2012 11:04:17 +0100

On 22/06/12 17:47, Christoffer Dall wrote:
> On Fri, Jun 22, 2012 at 10:22 AM, Marc Zyngier <marc.zyngier at arm.com> wrote:
>> On 22/06/12 04:00, Christoffer Dall wrote:
>>> On Mon, May 14, 2012 at 9:05 AM, Marc Zyngier <marc.zyngier at arm.com>
>>> wrote:
>>>> Add VGIC virtual CPU interface code, picking pending interrupts
>>>> from the distributor and stashing them in the VGIC control interface
>>>> list registers.
>>>>
>>>> Signed-off-by: Marc Zyngier <marc.zyngier at arm.com>
>>>> ---
>>>>  arch/arm/include/asm/kvm_vgic.h |   30 +++++
>>>>  arch/arm/kvm/vgic.c             |  230
>>>> ++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 259 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/arch/arm/include/asm/kvm_vgic.h
>>>> b/arch/arm/include/asm/kvm_vgic.h
>>>> index dca2d09..939f3f8 100644
>>>> --- a/arch/arm/include/asm/kvm_vgic.h
>>>> +++ b/arch/arm/include/asm/kvm_vgic.h
>>>> @@ -106,14 +106,44 @@ struct vgic_dist {
>>>>  };
>>>>
>>>>  struct vgic_cpu {
>>>> +#ifdef CONFIG_KVM_ARM_VGIC
>>>> +       spinlock_t      lock;
>>>> +
>>>> +       u8              vgic_irq_lr_map[VGIC_NR_IRQS];  /* per IRQ to LR
>>>> mapping */
>>>> +       u8              vgic_lr_irq_map[64];            /* per LR to IRQ
>>>> mapping */
>>>
>>> if someone changed the VGIC_NR_IRQS to higher than 256, you're in trouble
>>> here.
>>
>> Good point. Two option then: turning vgic_lr_irq_map[] into a u16 array,
>> or limiting VGIC_NR_IRQS to 256. I think this is high enough, but who
>> knows?
>>
> 
> probably high enough, but then we should have something like
> 
> #if VGIC_NR_IRQS > 256
> #error "blah"
> #endif
> 
> but it seems like this problem will magically go away from dropping the array.

Yes, it's now gone from my tree.

>>>> diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c
>>>> index 1ace859..ccd8b69 100644
>>>> --- a/arch/arm/kvm/vgic.c
>>>> +++ b/arch/arm/kvm/vgic.c
>>>> @@ -416,12 +416,35 @@ static void vgic_dispatch_sgi(struct kvm_vcpu
>>>> *vcpu, u32 reg)
>>>>
>>>>  static int compute_pending_for_cpu(struct kvm_vcpu *vcpu)
>>>>  {
>>>> -       return 0;
>>>> +       struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
>>>> +       unsigned long *pending, *enabled, *pend;
>>>> +       int vcpu_id;
>>>> +
>>>> +       vcpu_id = vcpu->vcpu_id;
>>>> +       pend = vcpu->arch.vgic_cpu.pending;
>>>
>>> why are we doing this work on the vgic structure? couldn't it just be
>>> a local variable here? why is this pending value from last call
>>> preserved?
>>
>> We need that state when populating the list registers.
>>
> 
> but you call it from there again, so you could just set a pointer
> being passed in?
> 
> it just scares me when you store values that are somehow 'transient'
> in nature (we should totally write this in ML or LISP). I guess it's a
> way of avoiding an extra kmalloc, although it's a bitmap that should
> always be statically allocatable by the caller (128 bytes max)?

Well, having it as a per-vcpu structure makes the allocation a one-off
cost. I'll turn it to a stack allocated bitmap if you feel it makes
things cleaner.

>>>>  /*
>>>>  * Update the interrupt state and determine which CPUs have pending
>>>>  * interrupts. Must be called with distributor lock held.
>>>> + *
>>>> + * It would be very tempting to just compute the pending bitmap once,
>>>> + * but that would make it quite ugly locking wise when a vcpu actually
>>>> + * moves the interrupt to its list registers (think of a single
>>>> + * interrupt pending on several vcpus). So we end up computing the
>>>> + * pending list twice (once here, and once in __kvm_vgic_sync_to_cpu).
>>>
>>> I don't understand this comment (at least not yet). Why are we
>>> calculating it here if we're calculating it in a different way, but
>>> for the same, in another function. I guess I will try and find out.
>>
>> A single interrupt can be targeted at multiple vcpus (through
>> irq_target). When we compute the global state by calling
>> vgic_update_state(), we consider that such an interrupt is pending on
>> all possible vcpus. Whoever services the interrupt first wins the race.
>>
>> Now, when a vcpu gets to be executed, we enter kvm_vgic_sync_to_cpu().
>> By virtue of holding both the distributor lock and the vgic_cpu lock, we
>> know the state is frozen. Because this interrupt can have been serviced
>> by another vcpu, we have to recompute the local vcpu state.
>>
>> If we don't want to recompute that state locally, then we'd have to
>> somehow notice that we're handling an interrupt with multiple targets,
>> acquire the other vcpu locks, and recompute their state! Net effect? Not
>> much.
>>
>> What we could also do is consider that these interrupts are too rare to
>> be cared about in an efficient manner, and give them a totally separate
>> code path.
>>
> 
> Hmm, ok. I kind of understand. But how does all this work for SPIs
> using the N-1 model? Is our approach not that we simply assign the SPI
> to the vCPU that just happens to be doing the first world-switch after
> that SPI was raised - if that CPU is masking interrupts and doing
> something for a long time, the interrupt latency is really long, no?
> 
> IOW is this not only a small step (if any) above just prematurely
> choosing one of the candidate VCPUs for each SPI configured to target
> more than one CPU (this should be allowed, if I read the note in
> section 1.4.3 of the GICv2 specs correctly)?

Hmmm. Actually, this may be our best option. And a Linux guest never
uses the multiple target facility anyway. I'll have a go at only picking
one CPU in the irq_target field and see how it simplifies things.

>>>> +/*
>>>> + * Fill the list registers with pending interrupts before running the
>>>> + * guest.
>>>> + */
>>>> +static void __kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +       struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
>>>> +       struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
>>>> +       unsigned long *pending;
>>>> +       int i, c, vcpu_id;
>>>> +       int overflow = 0;
>>>> +
>>>> +       vcpu_id = vcpu->vcpu_id;
>>>> +
>>>> +       /*
>>>> +        * We may not have any pending interrupt, or the interrupts
>>>> +        * may have been serviced from another vcpu. In all cases,
>>>> +        * move along.
>>>> +        */
>>>> +       if (!kvm_vgic_vcpu_pending_irq(vcpu) ||
>>>> +           !compute_pending_for_cpu(vcpu)) {
>>>
>>> I still don't see why there's a precomputed value here that we cannot
>>> trust, so we recalculate it here (can we ever trust the precomputed
>>> one?)
>>
>> See the detailed explanation above.
>>
>>> oh, but we can trust the precomputed value if there are no interrupts,
>>> so we only have to recalculate if there are pending interrupts - what
>>> is the explanation for this || here?
>>
>> Would you prefer the line below?
>> if (!(kvm_vgic_vcpu_pending_irq(vcpu) && compute_pending_for_cpu(vcpu)))
>>
> 
> so let's say there's not something pending on the CPU from the LRs,
> but there's something pending on the distributor side, then the
> functions would return:
> 
> kvm_vgic_vcpu_pending_irq: 0
> compute_pending_for_cpu: 1
> 
> if (!0 || !1)
>     goto epilog;
> 
> which is the same as:
> 
> if (1)
>     goto epilog;
> 
> which clears the irq_pending_on_cpu and never copies the pending
> distributor interrupt to the LR.
> 
> so, perhaps the ordering is incorrect, or the check should be:
> 
> /* nothing to transfer to LRs */
> if (!compute_pending_for_cpu(vcpu))
>     goto epilog;
> 
> /* do we want to do work if there's already pending LRs ? */
> if (kvm_vgic_vcpu_pending_irq(vcpu))
>     /* either goto epilog or continue with the fun, choice */
> 
> why is this hurting my brain?

Same here, and it's only Monday morning. I'll revisit this as it is
obviously flawed. Thanks for the truth table! ;-)

>>>> +void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +       struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
>>>> +       struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
>>>> +
>>>> +       if (!irqchip_in_kernel(vcpu->kvm))
>>>> +               return;
>>>> +
>>>> +       spin_lock(&dist->lock);
>>>> +       spin_lock(&vgic_cpu->lock);
>>>> +       __kvm_vgic_sync_to_cpu(vcpu);
>>>> +       spin_unlock(&vgic_cpu->lock);
>>>> +       spin_unlock(&dist->lock);
>>>
>>> Two spin_locks for every guest entry/exit. Can we not avoid this
>>> somehow? (so much work so far to have a lock-less entry in the common
>>> case).
>>
>> Interrupt injection is inherently racy. You really don't know what is
>> happening from userspace, or from another vcpu. I'm almost convinced we
>> could remove vgic_cpu->lock, as we only mess with that data in the
>> context of this vcpu.
>>
>> But the distributor lock is here to stay, I'm afraid.
>>
> 
> why is that? isn't everything bit operations that is serializable in
> nature? (I'm thinking about the fact that the distributor as a device
> doesn't have a lock, does it?)

Well, I definitely expect the HW distributor to have hazard checking
between MMIO accesses and external signaling. The guest itself should
have its own locking to serialize concurrent CPU access though.

I suppose we could use fancy stuff such as RCU to avoid the cost of a
single spinlock, but how often will this spinlock be contended? It would
take some profiling to find out, but I have the feeling that the
contention will be very low (we always kick the vcpu once the lock has
been released).

>>>> +       return !!(atomic_read(&dist->irq_pending_on_cpu) & (1 <<
>>>> vcpu->vcpu_id));
>>>
>>> if this gets changed to test_bit, you should creep just below the 81
>>> characters width here ;)
>>
>> Let me check first if we can actually allow this not to be atomic.
>>
> 
> as far as I can see you should be fine with the bit operations, they
> are atomic 'enough' :)

Probably. Will give it a go.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...