On 11/25/2014 02:12 PM, Eric Auger wrote: > On 11/25/2014 11:19 AM, Christoffer Dall wrote: >> [Re-adding cc list] >> >> On Mon, Nov 24, 2014 at 05:42:30PM +0100, Eric Auger wrote: >>> On 11/24/2014 04:47 PM, Christoffer Dall wrote: >>>> On Mon, Nov 24, 2014 at 12:02 PM, Eric Auger <eric.auger@xxxxxxxxxx> wrote: >>>>> On 11/24/2014 11:00 AM, Christoffer Dall wrote: >>>>>> On Sun, Nov 23, 2014 at 06:56:59PM +0100, Eric Auger wrote: >>>>>>> This patch enables irqfd on arm. >>>>>>> >>>>>>> Both irqfd and resamplefd are supported. Injection is implemented >>>>>>> in vgic.c without routing. >>>>>>> >>>>>>> This patch enables CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQFD. >>>>>>> >>>>>>> KVM_CAP_IRQFD is now advertised. KVM_CAP_IRQFD_RESAMPLE capability >>>>>>> automatically is advertised as soon as CONFIG_HAVE_KVM_IRQFD is set. >>>>>>> >>>>>>> Signed-off-by: Eric Auger <eric.auger@xxxxxxxxxx> >>>>>>> >>>>>>> --- >> >> [...] >> >>>>>>> +int kvm_set_irq(struct kvm *kvm, int irq_source_id, >>>>>>> + u32 irq, int level, bool line_status) >>>>>>> +{ >>>>>>> + unsigned int spi = irq + VGIC_NR_PRIVATE_IRQS; >>>>>>> + >>>>>>> + kvm_debug("irqfd sets vIRQ %d to %d\n", irq, level); >>>>>>> + >>>>>>> + if (likely(vgic_initialized(kvm))) { >>>>>>> + if (spi > kvm->arch.vgic.nr_irqs) >>>>>>> + return -EINVAL; >>>>>>> + return kvm_vgic_inject_irq(kvm, 0, spi, level); >>>>>>> + } else if (level && kvm_irq_has_notifier(kvm, 0, irq)) { >>>>>>> + /* >>>>>>> + * any IRQ injected before vgic readiness is >>>>>>> + * ignored and the notifier, if any, is called >>>>>>> + * immediately as if the virtual IRQ were completed >>>>>>> + * by the guest >>>>>>> + */ >>>>>>> + kvm_notify_acked_irq(kvm, 0, irq); >>>>>>> + return -EAGAIN; >>>>>> >>>>>> This looks fishy, I don't fully understand which case you're catering >>>>>> towards here (the comment is hard to understand), but my gut feeling is >>>>>> that you should back out (probably with an error) if the vgic is not >>>>>> initialized when calling this function. Setting up the routing in the >>>>>> first place should probably error out if the vgic is not available. >>>>> The issue is related to integration with QEMU and VFIO. >>>>> Currently VFIO signaling (we tell VFIO to signal the eventfd on a >>>>> peculiar physical IRQ) and irqfd setup (we tell KVM/IRQFD to handle >>>>> previous eventfd when triggered and inject a GSI) are done by QEMU >>>>> *before* the first vcpu run. so before VGIC is ready. >>>>> >>>>> Both are done in a so called QEMU machine init done notifier. It could >>>>> be done in a QEMU reset notifier but I guess the problem would be the >>>>> same. and I think the same at which QEMU initializes it is correct. >>>>> >>>>> As soon as both VFIO signaling and irqfd are setup, virtual IRQ are >>>>> likely to be injected and this is what happens on some circumstances. >>>>> This happens on the 2d QEMU run after killing the 1st QEMU session. For >>>>> some reason I guess the HW device - in my case the xgmac - was released >>>>> in such a way the IRQ wire was not reset. So as soon as VFIO driver >>>>> calls request_irq, the IRQ hits. >>>>> >>>>> I tried to change that this xgmac driver behavior but I must acknowledge >>>>> that for the time beeing I failed. I will continue investigating that. >>>>> >>>>> Indeed I might be fixing the issue at the wrong place. But anyway this >>>>> relies on the fact the assigned device driver toggled down the IRQ >>>>> properly when releasing. At the time the signaling is setup we have not >>>>> entered yet any driver reset code. >>>>> >>>> I see, it's because of the quirky way we initialize the vgic at time >>>> of running the first CPU, so user space can't really hook into the >>>> sweet spot after initializing the vgic but just before actually >>>> running guest code. >>> >>> Yes currently irqfd generic code has no way to test if the virtual >>> interrupt controller is "ready" (KVM__IRQFD -> kvm_irqfd_assign ) >>> because I guess other archs don't have the problem. >>>> >>>> Could it be that we simply need to let user space init the vgic after >>>> creating all its vcpus and only then allow setting up the irqfd? >>> >>> I start VFIO signaling and setup KVM_IRQFD in a QEMU machine init done >>> notifier. the vgic init then must happen before then. >> >> have all the vcpus that QEMU wants to create, been created by then? > Hi Christoffer, > > My understanding of QEMU start sequence is as follows: > at machine init, CPU realize function is called for each CPU > (target-arm/cpu.c arm_cpu_realizefn) > qemu_init_vcpu (cpus.c) > qemu_kvm_start_vcpu (cpus.c) > launch VCPU > thread (qemu_kvm_cpu_thread_fn) > kvm_init_vcpu > kvm_init_vcpu does the while (1) loop issuing KVM_RUN ioctl if *cpu_can_run* > > cpu_can_run returns true when vm_start is called > (resume_all_vcpus>cpu_resume) > > QEMU machine init done or reset callbacks happen after machine init and > before vm_start. > > The actual vgic "readiness" is set in the first vcpu run, ie on the > vm_start. > > vgic instantiation in virt machine file is done after cpu_instantiation. > We could force vgic init at that time, in the gic realize fn > (kvm_arm_gic_realize)?. All vcpus are instantiated: nr_vcpus and nr_irqs > are known. Also base addresses are known. > >> >>> >>> Currently the VGIC KVM device has group/attributes that allow to set >>> registers (vgic_attr_regs_access) and *partially* init the state >>> (vgic_init_maps). Enhancing that we could indeed force the vgic init >>> earlier. >>> >> >> We would probably add a new attribute to the vgic device api if we were >> to go down that road. > yes >> >>>> >>>> Alternatively we can refactor the whole thing so that we don't mess >>>> with the pending state etc. directly in the vgic_update_irq function, >>>> but just sets the level state (or remember that there was an edge, >>>> hummm, maybe not) and later propagate this to the vcpus in >>>> compute_pending_for_cpu(). >>> >>> yes we could indeed record a level info very early and not do anything >>> with it until the vgic is ready. This would oblige to init the level >>> bitmap yet in another earlier stage. >> >> there's an unsolved problem that you may not have created all your data >> structures or know how many IRQs you have at this point, right? > yes that's correct, at that time we do not know the nr_irqs. >> >>>> >>>> What you're doing here is to continously ack and re-request the irq >>>> from the vfio driver until you are ready to receive it, is that right? >>> yes I unmask the IRQ so that it can hit again (it was deactivated by the >>> host and masked the VFIO platform driver). If I don't call the notifier >>> it will remain masked and never hit again. >>> >> Don't you simply loose the interrupt for edge-triggered interrupts then? > yes we do. Even for level-sensitive, we loose them! > > But I think the core of the problem is why those IRQs do hit! And I now > realize I did not have a good representation of what happens on the ctrl > C of QEMU. the xgmac driver remove function is not called at all I guess > (explaining why when hacking it to clear IRQ and disable IRQ it did not > change anything :-( /me/ stupid). So the HW is left free running. Might > happen a XGMAC IRQ was high at that time. When VFIO driver is released, > it does not change the state of the XGMAC HW but this calls free_irq of > the physical IRQ. When QEMU reloads the VFIO driver on the second time > and call request_irq, it re-enables the IRQ at VGIC and XGMAC IRQ hit. > > The problem we observe relates to the lack of proper reset of the device > - old discussion we had with Alex in Feb! -. > > Those IRQ relate to the HW state of the last QEMU session. It may be > acceptable to ignore them until the guest duly resets the device. > handling the IRQ in the new QEMU session may be wrong as well. Some > devices might not understand why they get a functional IRQ while they > are not even initialized! not that this is also what happens in my implementation since as soon as the guest driver does request_irq it might get an unexpected IRQ. It works on xgmac but sure it works on another? > > The other alternative is to implement some device dependent reset on > host kernel side, some kind of kernel hook to the vfio driver but here > it is another rough path ... Another solution would be to advise the usage of a host-side user driver just to clean the state of the HW device in such a case. Laziness? Eric > > Best Regards > > Eric > > > > >> >> -Christoffer >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html