Re: [PATCH v4 2/3] KVM: arm: add irqfd support

Eric Auger <eric.auger@xxxxxxxxxx> · Tue, 25 Nov 2014 14:29:04 +0100



On 11/25/2014 02:12 PM, Eric Auger wrote:
> On 11/25/2014 11:19 AM, Christoffer Dall wrote:
>> [Re-adding cc list]
>>
>> On Mon, Nov 24, 2014 at 05:42:30PM +0100, Eric Auger wrote:
>>> On 11/24/2014 04:47 PM, Christoffer Dall wrote:
>>>> On Mon, Nov 24, 2014 at 12:02 PM, Eric Auger <eric.auger@xxxxxxxxxx> wrote:
>>>>> On 11/24/2014 11:00 AM, Christoffer Dall wrote:
>>>>>> On Sun, Nov 23, 2014 at 06:56:59PM +0100, Eric Auger wrote:
>>>>>>> This patch enables irqfd on arm.
>>>>>>>
>>>>>>> Both irqfd and resamplefd are supported. Injection is implemented
>>>>>>> in vgic.c without routing.
>>>>>>>
>>>>>>> This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD.
>>>>>>>
>>>>>>> KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability
>>>>>>> automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set.
>>>>>>>
>>>>>>> Signed-off-by: Eric Auger <eric.auger@xxxxxxxxxx>
>>>>>>>
>>>>>>> ---
>>
>> [...]
>>
>>>>>>> +int kvm_set_irq(struct kvm *kvm, int irq_source_id,
>>>>>>> +            u32 irq, int level, bool line_status)
>>>>>>> +{
>>>>>>> +    unsigned int spi = irq + VGIC_NR_PRIVATE_IRQS;
>>>>>>> +
>>>>>>> +    kvm_debug("irqfd sets vIRQ %d to %d\n", irq, level);
>>>>>>> +
>>>>>>> +    if (likely(vgic_initialized(kvm))) {
>>>>>>> +            if (spi > kvm->arch.vgic.nr_irqs)
>>>>>>> +                    return -EINVAL;
>>>>>>> +            return kvm_vgic_inject_irq(kvm, 0, spi, level);
>>>>>>> +    } else if (level && kvm_irq_has_notifier(kvm, 0, irq)) {
>>>>>>> +            /*
>>>>>>> +             * any IRQ injected before vgic readiness is
>>>>>>> +             * ignored and the notifier, if any, is called
>>>>>>> +             * immediately as if the virtual IRQ were completed
>>>>>>> +             * by the guest
>>>>>>> +             */
>>>>>>> +            kvm_notify_acked_irq(kvm, 0, irq);
>>>>>>> +            return -EAGAIN;
>>>>>>
>>>>>> This looks fishy, I don't fully understand which case you're catering
>>>>>> towards here (the comment is hard to understand), but my gut feeling is
>>>>>> that you should back out (probably with an error) if the vgic is not
>>>>>> initialized when calling this function.  Setting up the routing in the
>>>>>> first place should probably error out if the vgic is not available.
>>>>> The issue is related to integration with QEMU and VFIO.
>>>>> Currently VFIO signaling (we tell VFIO to signal the eventfd on a
>>>>> peculiar physical IRQ) and irqfd setup (we tell KVM/IRQFD to handle
>>>>> previous eventfd when triggered and inject a GSI) are done by QEMU
>>>>> *before* the first vcpu run. so before VGIC is ready.
>>>>>
>>>>> Both are done in a so called QEMU machine init done notifier. It could
>>>>> be done in a QEMU reset notifier but I guess the problem would be the
>>>>> same. and I think the same at which QEMU initializes it is correct.
>>>>>
>>>>> As soon as both VFIO signaling and irqfd are setup, virtual IRQ are
>>>>> likely to be injected and this is what happens on some circumstances.
>>>>> This happens on the 2d QEMU run after killing the 1st QEMU session. For
>>>>> some reason I guess the HW device - in my case the xgmac - was released
>>>>> in such a way the IRQ wire was not reset. So as soon as VFIO driver
>>>>> calls request_irq, the IRQ hits.
>>>>>
>>>>> I tried to change that this xgmac driver behavior but I must acknowledge
>>>>> that for the time beeing I failed. I will continue investigating that.
>>>>>
>>>>> Indeed I might be fixing the issue at the wrong place. But anyway this
>>>>> relies on the fact the assigned device driver toggled down the IRQ
>>>>> properly when releasing. At the time the signaling is setup we have not
>>>>> entered yet any driver reset code.
>>>>>
>>>> I see, it's because of the quirky way we initialize the vgic at time
>>>> of running the first CPU, so user space can't really hook into the
>>>> sweet spot after initializing the vgic but just before actually
>>>> running guest code.
>>>
>>> Yes currently irqfd generic code has no way to test if the virtual
>>> interrupt controller is "ready" (KVM__IRQFD -> kvm_irqfd_assign )
>>> because I guess other archs don't have the problem.
>>>>
>>>> Could it be that we simply need to let user space init the vgic after
>>>> creating all its vcpus and only then allow setting up the irqfd?
>>>
>>> I start VFIO signaling and setup KVM_IRQFD in a QEMU machine init done
>>> notifier. the vgic init then must happen before then.
>>
>> have all the vcpus that QEMU wants to create, been created by then?
> Hi Christoffer,
> 
> My understanding of QEMU start sequence is as follows:
> at machine init, CPU realize function is called for each CPU
> (target-arm/cpu.c arm_cpu_realizefn)
> qemu_init_vcpu (cpus.c) > qemu_kvm_start_vcpu (cpus.c) > launch VCPU
> thread (qemu_kvm_cpu_thread_fn) > kvm_init_vcpu
> kvm_init_vcpu does the while (1) loop issuing KVM_RUN ioctl if *cpu_can_run*
> 
> cpu_can_run returns true when vm_start is called
> (resume_all_vcpus>cpu_resume)
> 
> QEMU machine init done or reset callbacks happen after machine init and
> before vm_start.
> 
> The actual vgic "readiness" is set in the first vcpu run, ie on the
> vm_start.
> 
> vgic instantiation in virt machine file is done after cpu_instantiation.
> We could force vgic init at that time, in the gic realize fn
> (kvm_arm_gic_realize)?. All vcpus are instantiated: nr_vcpus and nr_irqs
> are known. Also base addresses are known.
> 
>>
>>>
>>> Currently the VGIC KVM device has group/attributes that allow to set
>>> registers (vgic_attr_regs_access) and *partially* init the state
>>> (vgic_init_maps). Enhancing that we could indeed force the vgic init
>>> earlier.
>>>
>>
>> We would probably add a new attribute to the vgic device api if we were
>> to go down that road.
> yes
>>
>>>>
>>>> Alternatively we can refactor the whole thing so that we don't mess
>>>> with the pending state etc. directly in the vgic_update_irq function,
>>>> but just sets the level state (or remember that there was an edge,
>>>> hummm, maybe not) and later propagate this to the vcpus in
>>>> compute_pending_for_cpu().
>>>
>>> yes we could indeed record a level info very early and not do anything
>>> with it until the vgic is ready. This would oblige to init the level
>>> bitmap yet in another earlier stage.
>>
>> there's an unsolved problem that you may not have created all your data
>> structures or know how many IRQs you have at this point, right?
> yes that's correct, at that time we do not know the nr_irqs.
>>
>>>>
>>>> What you're doing here is to continously ack and re-request the irq
>>>> from the vfio driver until you are ready to receive it, is that right?
>>> yes I unmask the IRQ so that it can hit again (it was deactivated by the
>>> host and masked the VFIO platform driver). If I don't call the notifier
>>> it will remain masked and never hit again.
>>>
>> Don't you simply loose the interrupt for edge-triggered interrupts then?
> yes we do. Even for level-sensitive, we loose them!
> 
> But I think the core of the problem is why those IRQs do hit! And I now
> realize I did not have a good representation of what happens on the ctrl
> C of QEMU. the xgmac driver remove function is not called at all I guess
> (explaining why when hacking it to clear IRQ and disable IRQ it did not
> change anything :-( /me/ stupid). So the HW is left free running. Might
> happen a XGMAC IRQ was high at that time. When VFIO driver is released,
> it does not change the state of the XGMAC HW but this calls free_irq of
> the physical IRQ. When QEMU reloads the VFIO driver on the second time
> and call request_irq, it re-enables the IRQ at VGIC and XGMAC IRQ hit.
> 
> The problem we observe relates to the lack of proper reset of the device
> - old discussion we had with Alex in Feb! -.
> 
> Those IRQ relate to the HW state of the last QEMU session. It may be
> acceptable to ignore them until the guest duly resets the device.
> handling the IRQ in the new QEMU session may be wrong as well. Some
> devices might not understand why they get a functional IRQ while they
> are not even initialized!
not that this is also what happens in my implementation since as soon as
the guest driver does request_irq it might get an unexpected IRQ. It
works on xgmac but sure it works on another?
> 
> The other alternative is to implement some device dependent reset on
> host kernel side, some kind of kernel hook to the vfio driver but here
> it is another rough path ...
Another solution would be to advise the usage of a host-side user driver
just to clean the state of the HW device in such a case. Laziness?

Eric
> 
> Best Regards
> 
> Eric
> 
> 
> 
> 
>>
>> -Christoffer
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html