Gleb Natapov wrote: > On Thu, May 14, 2009 at 10:34:11PM +0800, Dong, Eddie wrote: >> Gleb Natapov wrote: >>> On Thu, May 14, 2009 at 09:43:33PM +0800, Dong, Eddie wrote: >>>> Avi Kivity wrote: >>>>> Dong, Eddie wrote: >>>>>> OK. >>>>>> Also back to Gleb's question, the reason I want to do that is to >>>>>> simplify event generation mechanism in current KVM. >>>>>> >>>>>> Today KVM use additional layer of exception/nmi/interrupt such as >>>>>> vcpu.arch.exception.pending, vcpu->arch.interrupt.pending & >>>>>> vcpu->arch.nmi_injected. All those additional layer is due to >>>>>> compete of VM_ENTRY_INTR_INFO_FIELD >>>>>> write to inject the event. Both SVM & VMX has only one resource >>>>>> to inject the virtual event but KVM generates 3 catagory of >>>>>> events in parallel which further requires additional >>>>>> logic to dictate among them. >>>>> >>>>> I thought of using a queue to hold all pending events (in a common >>>>> format), sort it by priority, and inject the head. >>>> >>>> The SDM Table 5-4 requires to merge 2 events together, i.e. >>>> convert to #DF/ Triple fault or inject serially when 2 events >>>> happens no matter NMI, IRQ or exception. >>>> >>>> As if considering above events merging activity, that is a single >>>> element queue. >>> I don't know how you got to this conclusion from you previous >>> statement. See explanation to table 5-2 for instate where it is >>> stated that interrupt should be held pending if there is exception >>> with higher priority. Should be held pending where? In the queue, >>> like we do. Note that low prio exceptions are just dropped since >>> they will be regenerated. >> >> I have different understanding here. >> My understanding is that "held" means NO INTA in HW, i.e. LAPIC >> still hold this IRQ. >> > And what if INTA already happened and CPU is ready to fetch IDT for > interrupt vector and at this very moment CPU faults? If INTA happens, that means it is delivered. If its delivery triggers another exception, that is what Table5-4 handles. My understanding is that it is 2 stage process. Table 5-2 talk about events happening before delivery, so that HW needs to prioritize them. Once a decision is make, the highest one is delivered but then it could trigger another exception when fetching IDT etc. Current execption.pending/interrupt.pending/nmi_injected doesn't match either of above, interrupt/nmi is only for failed event injection, and a strange fixed priority check when it is really injected: exception > failed NMI > failed IRQ > new NMI > new IRQ. Table 5-2 looks missed in current KVM IMO except a wrong (but minor) exception > NMI > IRQ sequence. > >>> >>>> We could have either: 1) A pure SW "queue" that will be flush to >>>> HW register later (VM_ENTRY_INTR_INFO_FIELD), 2) Direct use HW >>>> register. >>>> >>> We have three event sources 1) exceptions 2) IRQ 3) NMI. We should >>> have queue of three elements sorted by priority. On each entry we >>> should >> >> Table 5-4 alreadys says NMI/IRQ is BENIGN. > Table 5-2 applies here not table 5-4 I think. > >> >>> inject an event with highest priority. And remove it from queue on >>> exit. >> >> The problem is that we have to decide to inject only one of above 3, >> and discard the rest. Whether priority them or merge (to one event >> as Table 5-4) is another story. > Only a small number of event are merged into #DF. Most handled > serially (SDM does not define what serially means unfortunately), so > I don't understand where "discard the rest" is come from. We can vmx_complete_interrupts clear all of them at next EXIT. Even from HW point of view, if there are pending NMI/IRQ/exception, CPU pick highest one, NMI, ignore/discard IRQ (but LAPIC still holds IRQ, thus it can be re-injected), completely discard exception. I don't say discarding has any problem, but unnecessary to keep all of 3. the only difference is when to discard the rest 2, at queue_exception/irq/nmi time or later on (even at next EXIT time), which is same to me. > discard exception since it will be regenerated anyway, but IRQ and > NMI is another story. SDM says that IRQ should be held pending (once > again not much explanation here), nothing about NMI. > >>> >>>> >>>> A potential benefit is that it can avoid duplicated code and >>>> potential bugs in current code as following patch shows if I >>>> understand correctly: >>>> >>>> --- a/arch/x86/kvm/vmx.c >>>> +++ b/arch/x86/kvm/vmx.c >>>> @@ -2599,7 +2599,7 @@ static int handle_exception(struct kvm_vcpu >>>> *vcpu, struct kvm_run *kvm_run) cr2 = >>>> vmcs_readl(EXIT_QUALIFICATION); >>>> KVMTRACE_3D(PAGE_FAULT, vcpu, >>>> error_code, (u32)cr2, (u32)((u64)cr2 >> 32), handler); - >>>> if (vcpu->arch.interrupt.pending || vcpu->arch.exception.pending ) >>>> + if (vcpu->arch.interrupt.pending || >>>> vcpu->arch.exception.pending || >>>> vcpu->arch.nmi_injected) kvm_mmu_unprotect_page_virt(vcpu, >>>> cr2); return kvm_mmu_page_fault(vcpu, cr2, error_code); } >>> This fix is already in Avi's tree (not yet pushed). >>> >>>> Either way are OK and up to you. BTW Xen uses HW register directly >>>> to representing an pending event. >>>> >>> In this particular case I don't mind to use HW register either, but >>> I don't see any advantage. >>> >>>>> >>>>>> One example is that exception has higher priority >>>>>> than NMI/IRQ injection in current code which is not true in >>>>>> reality. >>>>>> >>>>> >>>>> I don't think it matters in practice, since the guest will see it >>>>> as a timing issue. NMIs and IRQs are asynchronous (even those >>>>> generated by the guest through the local APIC). >>>> >>>> Yes. But also cause IRQ injection be delayed which may have side >>>> effect. For example if guest exception handler is very longer or >>>> if guest VCPU fall into recursive #GP. Within current logic, a >>>> guest IRQ event from KDB (IPI) running on VCPU0, as an example, >>>> can't force the dead loop VCPU1 into KDB since it is recursively >>>> #GP. >>> If one #GP causes another #GP this is a #DF. If CPU has a chance to >> >> Means another #GP in next instruction i.e. Beginning of #GP handler >> in guest. >> No #DF here. >> > In this case we will enter guest with "NMI windows open" request and > should exit immediately before first instruction of #GP handler. At > this moment KVM will be able to inject NMI. If the HW NMI windows is supported, it is fine, how about SW NMI case? The flow will then look like: Guest #GP instruction -> VM Exit -> Inject virtual #GP -> VMRESUME -> try to execute 1st ins of guest #GP handler -> VM Exit again (#GP) -> inject virtual #GP -> ..... > >>> executes >>> something in between KVM will have a chance to inject NMI. >> >> Could have no chance in some cases though not very common. >> >>> >>>> >>>>> >>>>>> Another issue is that an failed event from previous injection say >>>>>> IRQ or NMI may be discarded if an virtual exception happens in >>>>>> the EXIT handling now. With the patch of generic double fault >>>>>> handling, this case should be handled as normally. >>>>>> >>>>> >>>>> Discarding an exception is usually okay as it will be regenerated. >>>>> I don't think we discard interrupts or NMIs. >>>> In reality (Running OS in guest), it doesn't happen so far. But >>>> architecturally, it could. For example KVM injects an IRQ, but VM >>>> Resume get #PF and back to KVM with IDT_VECTORING valid. Then KVM >>>> will put back the failed IRQ to interrupt queue. But if #PF >>>> handling generates another exception, then the interrupt queue >>>> won't be able to be injected, since KVM inject exception first. >>>> And the interrupt queue is discarded at next VM Exit. >>>> >>> I acknowledge the presence of the bug although I was not able to >>> write a test case to cause it yet, but it is easy to fix this >>> without changing code too much. Unified event queue and clearing of >>> only injected event on exit should do the trick. >> >> Yes, minor. >> >> Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html