On Fri, May 26, 2017 at 09:31:00AM +0200, Christoffer Dall wrote: > Hi Drew, > > On Tue, May 16, 2017 at 04:20:27AM +0200, Andrew Jones wrote: > > Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx> > > --- > > Documentation/virtual/kvm/vcpu-requests.rst | 299 ++++++++++++++++++++++++++++ > > 1 file changed, 299 insertions(+) > > create mode 100644 Documentation/virtual/kvm/vcpu-requests.rst > > > > diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst > > new file mode 100644 > > index 000000000000..7a2b4c05c267 > > --- /dev/null > > +++ b/Documentation/virtual/kvm/vcpu-requests.rst > > @@ -0,0 +1,299 @@ > > +================= > > +KVM VCPU Requests > > +================= > > + > > +Overview > > +======== > > + > > +KVM supports an internal API enabling threads to request a VCPU thread to > > +perform some activity. For example, a thread may request a VCPU to flush > > +its TLB with a VCPU request. The API consists of the following functions:: > > + > > + /* Check if any requests are pending for VCPU @vcpu. */ > > + bool kvm_request_pending(struct kvm_vcpu *vcpu); > > + > > + /* Check if VCPU @vcpu has request @req pending. */ > > + bool kvm_test_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* Clear request @req for VCPU @vcpu. */ > > + void kvm_clear_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* > > + * Check if VCPU @vcpu has request @req pending. When the request is > > + * pending it will be cleared and a memory barrier, which pairs with > > + * another in kvm_make_request(), will be issued. > > + */ > > + bool kvm_check_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* > > + * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs > > + * with another in kvm_check_request(), prior to setting the request. > > + */ > > + void kvm_make_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ > > + bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); > > + > > +Typically a requester wants the VCPU to perform the activity as soon > > +as possible after making the request. This means most requests > > +(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), > > +and kvm_make_all_cpus_request() has the kicking of all VCPUs built > > +into it. > > + > > +VCPU Kicks > > +---------- > > + > > +The goal of a VCPU kick is to bring a VCPU thread out of guest mode in > > +order to perform some KVM maintenance. To do so, an IPI is sent, forcing > > +a guest mode exit. However, a VCPU thread may not be in guest mode at the > > +time of the kick. Therefore, depending on the mode and state of the VCPU > > +thread, there are two other actions a kick may take. All three actions > > +are listed below: > > + > > +1) Send an IPI. This forces a guest mode exit. > > +2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest > > + mode that wait on waitqueues. Waking them removes the threads from > > + the waitqueues, allowing the threads to run again. This behavior > > + may be suppressed, see KVM_REQUEST_NO_WAKEUP below. > > +3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not > > + sleeping, then there is nothing to do. > > + > > +VCPU Mode > > +--------- > > + > > +VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the > > +guest is running in guest mode or not, as well as some specific > > +outside guest mode states. The architecture may use ``vcpu->mode`` to > > +ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), > > +as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and > > +even to ensure IPI acknowledgements are waited upon (see "Waiting for > > +Acknowledgements"). The following modes are defined: > > + > > +OUTSIDE_GUEST_MODE > > + > > + The VCPU thread is outside guest mode. > > + > > +IN_GUEST_MODE > > + > > + The VCPU thread is in guest mode. > > + > > +EXITING_GUEST_MODE > > + > > + The VCPU thread is transitioning from IN_GUEST_MODE to > > + OUTSIDE_GUEST_MODE. > > + > > +READING_SHADOW_PAGE_TABLES > > + > > + The VCPU thread is outside guest mode, but it wants the sender of > > + certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU > > + thread is done reading the page tables. > > + > > +VCPU Request Internals > > +====================== > > + > > +VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. > > +This means general bitops, like those documented in [atomic-ops]_ could > > +also be used, e.g. :: > > + > > + clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); > > + > > +However, VCPU request users should refrain from doing so, as it would > > +break the abstraction. The first 8 bits are reserved for architecture > > +independent requests, all additional bits are available for architecture > > +dependent requests. > > + > > +Architecture Independent Requests > > +--------------------------------- > > + > > +KVM_REQ_TLB_FLUSH > > + > > + KVM's common MMU notifier may need to flush all of a guest's TLB > > + entries, calling kvm_flush_remote_tlbs() to do so. Architectures that > > + choose to use the common kvm_flush_remote_tlbs() implementation will > > + need to handle this VCPU request. > > + > > +KVM_REQ_MMU_RELOAD > > + > > + When shadow page tables are used and memory slots are removed it's > > + necessary to inform each VCPU to completely refresh the tables. This > > + request is used for that. > > + > > +KVM_REQ_PENDING_TIMER > > + > > + This request may be made from a timer handler run on the host on behalf > > + of a VCPU. It informs the VCPU thread to inject a timer interrupt. > > + > > +KVM_REQ_UNHALT > > + > > + This request may be made from the KVM common function kvm_vcpu_block(), > > + which is used to emulate an instruction that causes a CPU to halt until > > + one of an architectural specific set of events and/or interrupts is > > + received (determined by checking kvm_arch_vcpu_runnable()). When that > > + event or interrupt arrives kvm_vcpu_block() makes the request. This is > > + in contrast to when kvm_vcpu_block() returns due to any other reason, > > + such as a pending signal, which does not indicate the VCPU's halt > > + emulation should stop, and therefore does not make the request. > > + > > +KVM_REQUEST_MASK > > +---------------- > > + > > +VCPU requests should be masked by KVM_REQUEST_MASK before using them with > > +bitops. This is because only the lower 8 bits are used to represent the > > +request's number. The upper bits are used as flags. Currently only two > > +flags are defined. > > + > > +VCPU Request Flags > > +------------------ > > + > > +KVM_REQUEST_NO_WAKEUP > > + > > + This flag is applied to a request that does not need immediate > > + attention. When a request does not need immediate attention, and the > > Isn't this an over-simplification? I thought KVM_REQUEST_NO_WAKEUP was > only used in cases where you simply want to make sure the VCPU is not > ignoring a request while exexuting in the guest, but if it's sleeping > and therefore not in the guest, it can just handle the request whenever > it wakes up anyway. So perhaps something like: > > This flag is applied to a request that only needs immediate attention > from VCPUs running in the guest. That is, sleeping VCPUs do not need > to be removed from their waitqueues but can handle this request when > they wake up for any other reason. I like this paragraph. I'll use it in v5. > > > + VCPU's thread is outside guest mode sleeping, then the thread is not > > + awaken by a kick. > > + > > +KVM_REQUEST_WAIT > > + > > + When requests with this flag are made with kvm_make_all_cpus_request(), > > + then the caller will wait for each VCPU to acknowledge the IPI before > > + proceeding. > > How does the interaction work with KVM_REQUEST_NO_WAKEUP? Does it mean > it only waits on those VCPUs to which is needs to send an IPI, but the > rest are ignored ? I'll discuss interaction with KVM_REQUEST_NO_WAKEUP here and also refer to the "Waiting for Acknowledgements" section below. > > > + > > +VCPU Requests with Associated State > > +=================================== > > + > > +Requesters that want the receiving VCPU to handle new state need to ensure > > +the newly written state is observable to the receiving VCPU thread's CPU > > +by the time it observes the request. This means a write memory barrier > > +must be inserted after writing the new state and before setting the VCPU > > +request bit. Additionally, on the receiving VCPU thread's side, a > > +corresponding read barrier must be inserted after reading the request bit > > +and before proceeding to read the new state associated with it. See > > +scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation > > +[memory-barriers]_. > > + > > +The pair of functions, kvm_check_request() and kvm_make_request(), provide > > +the memory barriers, allowing this requirement to be handled internally by > > +the API. > > + > > +Ensuring Requests Are Seen > > +========================== > > + > > +When making requests to VCPUs, we want to avoid the receiving VCPU > > +executing in guest mode for an arbitrary long time without handling the > > +request. We can be sure this won't happen as long as we ensure the VCPU > > +thread checks kvm_request_pending() before entering guest mode and that a > > +kick will send an IPI when necessary. Extra care must be taken to cover > > ^ to force an exit from the guest > > > +the period after the VCPU thread's last kvm_request_pending() check and > > +before it has entered guest mode, as kick IPIs will only trigger VCPU run > > +loops for VCPU threads that are in guest mode or at least have already > > IPIs trigger VCPU run loops? I think you mean that IPIs triggers a > return from guest mode which eventually results in running another > iteration of the run loop? yeah, that's what I mean. I'll clarify it. > > > +disabled interrupts in order to prepare to enter guest mode. This means > > +that an optimized implementation (see "IPI Reduction") must be certain > > +when it's safe to not send the IPI. One solution, which all architectures > > +except s390 apply, is to: > > + > > +- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and > > + the last kvm_request_pending() check; > > +- enable interrupts atomically when entering the guest. > > + > > +This solution also requires memory barriers to be placed carefully in both > > +the requesting thread and the receiving VCPU. With the memory barriers we > > +can exclude the possibility of a VCPU thread observing > > +!kvm_request_pending() on its last check and then not receiving an IPI for > > +the next request made of it, even if the request is made immediately after > > +the check. This is done by way of the Dekker memory barrier pattern > > +(scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables, > > +this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting > > +them into the pattern gives:: > > + > > + CPU1 CPU2 > > + ================= ================= > > + local_irq_disable(); > > + WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu); > > + smp_mb(); smp_mb(); > > + if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) == > > + IN_GUEST_MODE) { > > + ...abort guest entry... ...send IPI... > > + } } > > + > > +As stated above, the IPI is only useful for VCPU threads in guest mode or > > +that have already disabled interrupts. This is why this specific case of > > +the Dekker pattern has been extended to disable interrupts before setting > > +``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to > > +pedantically implement the memory barrier pattern, guaranteeing the > > +compiler doesn't interfere with ``vcpu->mode``'s carefully planned > > +accesses. > > + > > +IPI Reduction > > +------------- > > + > > +As only one IPI is needed to get a VCPU to check for any/all requests, > > +then they may be coalesced. This is easily done by having the first IPI > > +sending kick also change the VCPU mode to something !IN_GUEST_MODE. The > > +transitional state, EXITING_GUEST_MODE, is used for this purpose. > > + > > +Waiting for Acknowledgements > > +---------------------------- > > + > > +Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to > > +be sent, and the acknowledgements to be waited upon, even when the target > > +VCPU threads are in modes other than IN_GUEST_MODE. For example, one case > > +is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which > > +is set after disabling interrupts. For these cases, the "should send an > > +IPI" condition becomes READ_ONCE(``vcpu->mode``) != OUTSIDE_GUEST_MODE. > > Hmm, did you mean, "To support these cases, the condition for sending an > IPI checks not to be equal to IN_GUEST_MODE, but different from > OUTSIDE_GUEST_MODE.". Yup, I'll clarify it. > > (The confusion is whether we check different things depending on which > request we're dealing with, or just explaining why the single > implementation is done the way it is.) Checking different things depends on the request type (requests flagged with KVM_REQUEST_WAIT). I'll clarify this. > > > + > > +Request-less VCPU Kicks > > +----------------------- > > + > > +As the determination of whether or not to send an IPI depends on the > > +two-variable Dekker memory barrier pattern, then it's clear that > > +request-less VCPU kicks are almost never correct. Without the assurance > > +that a non-IPI generating kick will still result in an action by the > > +receiving VCPU, as the final kvm_request_pending() check does for > > +request-accompanying kicks, then the kick may not do anything useful at > > +all. If, for instance, a request-less kick was made to a VCPU that was > > +just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then > > +the VCPU thread may continue its entry without actually having done > > +whatever it was the kick was meant to initiate. > > + > > +One exception is x86's posted interrupt mechanism. In this case, however, > > +even the request-less VCPU kick is coupled with the same > > +local_irq_disable() + smp_mb() pattern described above; the ON bit > > +(Outstanding Notification) in the posted interrupt descriptor takes the > > +role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is > > +set before reading ``vcpu->mode``; dually, in the VCPU thread, > > +vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to > > +IN_GUEST_MODE. > > + > > +Additional Considerations > > +========================= > > + > > +Sleeping VCPUs > > +-------------- > > + > > +VCPU threads may need to consider requests before and/or after calling > > +functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they > > +do or not, and, if they do, which requests need consideration, is > > +architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable() > > +to check if it should awaken. One reason to do so is to provide > > +architectures a function where requests may be checked if necessary. > > + > > +Clearing Requests > > +----------------- > > + > > +Generally it only makes sense for the receiving VCPU thread to clear a > > +request. However, in some circumstances, such as when the requesting > > +thread is executing synchronously with the receiving VCPU thread, it's > > what does it mean that the requesting thread is executing synchronously > with the receiving VCPU thread? How can that ever be enforced on SMP > system, if the two threads are not the same thread? > > If we simply mean that some sequence of operations between the two > threads is synchronized (for example using locks), then that's a > different (less general) case from the two thread overall executing > synchronously, I think? I meant with locks - not a general synchronization, but a temporarily enforced one. I'll try to clarify that language. > > > +possible to know that the request may be cleared immediately, rather than > > +waiting for the receiving VCPU thread to handle the request in VCPU RUN. > > +The only current examples of this are kvm_vcpu_block() calls, where a > > +side-effect of a call may be to set KVM_REQ_UNHALT. When the requesting > > +thread is itself the receiving VCPU, > > when the requsting thread and the VCPU thread are the same thread ? I use your language here as it's clearer. > > > then it's possible to know that the > > +request does not need to be handled in VCPU RUN, and therefore may be > > +cleared immediately. > > + > > +References > > +========== > > + > > +.. [atomic-ops] Documentation/core-api/atomic_ops.rst > > +.. [memory-barriers] Documentation/memory-barriers.txt > > +.. [lwn-mb] https://lwn.net/Articles/573436/ > > -- > > 2.9.3 > > > > This write up is excellent. Thanks! > > My comments are mostly for clarification, so: > > Acked-by: Christoffer Dall <cdall@xxxxxxxxxx> > > > Thanks, > -Christoffer drew