Vivek Goyal <vgoyal@xxxxxxxxxx> writes: > On Wed, Apr 08, 2020 at 08:01:36PM +0200, Thomas Gleixner wrote: >> Forget the current pf async semantics (or the lack of). You really want >> to start from scratch and igore the whole thing. >> >> The charm of #VE is that the hardware can inject it and it's not nesting >> until the guest cleared the second word in the VE information area. If >> that word is not 0 then you get a regular vmexit where you suspend the >> vcpu until the nested problem is solved. > > So IIUC, only one process on a vcpu could affort to relinquish cpu to > another task. If next task also triggers EPT violation, that will result > in VM exit (as previous #VE is not complete yet) and vcpu will be > halted. No. The guest #VE handler reads VE information, stores it into a PV page which is used to communicate with the host, invokes the hypercall and clears word1 so a consecutive #VE can be handled. If the hypercall is telling the host to suspend the guest (e.g. because the exception hit an interrupt or preemption disabled region where scheduling is not possible) then the host suspends the vCPU until the EPT issue is fixed. Before that hypercall returns the state of the recursion prevention word is irrelevant as the vCPU is not running, so it's fine to clear it afterwards. If the hypercall is telling the host that it can schedule and do something else, then the word is cleared after the hypercall returns. This makes sure that the host has gathered the information before another VE can be injected. TBH, I really was positively surprised that this HW mechanism actually makes sense. It prevents the following situation: 1) Guest triggers a EPT fail 2) HW injects #VE sets VE info 3) Guest handles #VE and before being able to gather the VE info data it triggers another EPT fail 4) HW injects #VE sets VE info -> FAIL So if we clear the reentrancy protection after saving the info in the PV page and after the hypercall returns, it's guaranteed that the above results in: 1) Guest triggers a EPT fail 2) HW injects #VE sets VE info 3) Guest handles #VE and before being able to gather the VE info data and clearing the reentrancy protection it triggers another EPT fail 4) VMEXIT which needs to be handled by suspending the vCPU, solving the issue and resuming it, which allows it to handle the original fail #1 >> So you really don't worry about the guest CPU state at all. The guest >> side #VE handler has to decide what it wants from the host depending on >> it's internal state: >> >> - Suspend me and resume once the EPT fail is solved >> >> - Let me park the failing task and tell me once you resolved the >> problem. >> >> That's pretty straight forward and avoids the whole nonsense which the >> current mess contains. It completely avoids the allocation stuff as well >> as you need to use a PV page where the guest copies the VE information >> to. >> >> The notification that a problem has been resolved needs to go through a >> separate vector which still has the IF=1 requirement obviously. > > How is this vector decided between guest and host. That's either a fixed vector which then becomes ABI or the guest tells the host via some PV/hypercall interface that it can handle #VE and that the vector for notification is number X. > Failure to fault in page will be communicated through same > vector? The PV page which communicates the current and eventually pending EPT fails to the host is also the communication mechanism for the outcome. Lets assume that the PV page contains an array of guest/host communication data structures: struct ve_info { struct ve_hw_info hw_info; unsigned long state; struct rcu_wait wait; ); The PV page looks like this: struct ve_page { struct ve_info info[N_ENTRIES]; unsigned int guest_current; unsigned int host_current; unsigned long active[BITS_TO_LONGS(N_ENTRIES)]; }; The #VE handler does: struct ve_page *vp = this_cpu_ptr(&ve_page); struct ve_info *info; bool can_continue; idx = find_first_zero_bit(vp->active, N_ENTRIES); BUG_ON(idx >= N_ENTRIES); set_bit(idx, vp->active); info = vp->info + idx; copy_ve_info(&info->hw_info, ve_hwinfo); vp->guest_current = idx; if (test_and_set_thread_flag(TIF_IN_VE) || bitmap_full(vp->active, N_ENTRIES)) can_continue = false; else can_continue = user_mode(regs) || preemptible(); if (can_continue) { info->state = CAN_CONTINUE; hypercall_ve(); ve_hwinfo.reentrancy_protect = 0; rcuwait_wait_event(&info->wait, info->state != CAN_CONTINUE, TASK_IDLE); } else { info->state = SUSPEND_ME; hypercall_ve(); ve_hwinfo.reentrancy_protect = 0; } state = info->state; info->state = NONE; clear_bit(idx, vp->active); switch (state) { case RESOLVED: break; case FAIL: if (user_mode(regs)) send_signal(.....); else if (!fixup_exception()) panic("I'm toast"); break; default: panic("Confused"); } clear_thread_flag(TIF_IN_VE); Now the host completion does: struct ve_page *vp = vcpu->ve_page; struct ve_info *info = vp->info + idx; state = info->state; info->state = resolved_state; switch (state) { case SUSPEND_ME: resume_vcpu(vcpu); break; case CAN_CONTINUE: queue_completion(vcpu, idx); break; default: kill_guest(); } and the host completion injection which handles the queued completion when guest IF=0 does: struct ve_page *vp = vcpu->ve_page; vp->host_current = idx; inject_ve_complete(vcpu); The guest completion does: struct ve_page *vp = this_cpu_ptr(&ve_page); struct ve_info *info; info = vp->info + vp->host_current; rcuwait_wake_up(&info->wait); There are a bunch of life time issues to think about (not rocket science), but you get the idea. Thanks, tglx