Re: [PATCH v4 03/11] KVM: Add documentation for VCPU requests

Andrew Jones <drjones@xxxxxxxxxx> · Fri, 26 May 2017 11:43:39 +0200

On Fri, May 26, 2017 at 09:31:00AM +0200, Christoffer Dall wrote:
> Hi Drew,
> 
> On Tue, May 16, 2017 at 04:20:27AM +0200, Andrew Jones wrote:
> > Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx>
> > ---
> >  Documentation/virtual/kvm/vcpu-requests.rst | 299 ++++++++++++++++++++++++++++
> >  1 file changed, 299 insertions(+)
> >  create mode 100644 Documentation/virtual/kvm/vcpu-requests.rst
> > 
> > diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst
> > new file mode 100644
> > index 000000000000..7a2b4c05c267
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/vcpu-requests.rst
> > @@ -0,0 +1,299 @@
> > +=================
> > +KVM VCPU Requests
> > +=================
> > +
> > +Overview
> > +========
> > +
> > +KVM supports an internal API enabling threads to request a VCPU thread to
> > +perform some activity.  For example, a thread may request a VCPU to flush
> > +its TLB with a VCPU request.  The API consists of the following functions::
> > +
> > +  /* Check if any requests are pending for VCPU @vcpu. */
> > +  bool kvm_request_pending(struct kvm_vcpu *vcpu);
> > +
> > +  /* Check if VCPU @vcpu has request @req pending. */
> > +  bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
> > +
> > +  /* Clear request @req for VCPU @vcpu. */
> > +  void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
> > +
> > +  /*
> > +   * Check if VCPU @vcpu has request @req pending. When the request is
> > +   * pending it will be cleared and a memory barrier, which pairs with
> > +   * another in kvm_make_request(), will be issued.
> > +   */
> > +  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
> > +
> > +  /*
> > +   * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
> > +   * with another in kvm_check_request(), prior to setting the request.
> > +   */
> > +  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
> > +
> > +  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
> > +  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
> > +
> > +Typically a requester wants the VCPU to perform the activity as soon
> > +as possible after making the request.  This means most requests
> > +(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
> > +and kvm_make_all_cpus_request() has the kicking of all VCPUs built
> > +into it.
> > +
> > +VCPU Kicks
> > +----------
> > +
> > +The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
> > +order to perform some KVM maintenance.  To do so, an IPI is sent, forcing
> > +a guest mode exit.  However, a VCPU thread may not be in guest mode at the
> > +time of the kick.  Therefore, depending on the mode and state of the VCPU
> > +thread, there are two other actions a kick may take.  All three actions
> > +are listed below:
> > +
> > +1) Send an IPI.  This forces a guest mode exit.
> > +2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
> > +   mode that wait on waitqueues.  Waking them removes the threads from
> > +   the waitqueues, allowing the threads to run again.  This behavior
> > +   may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
> > +3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
> > +   sleeping, then there is nothing to do.
> > +
> > +VCPU Mode
> > +---------
> > +
> > +VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
> > +guest is running in guest mode or not, as well as some specific
> > +outside guest mode states.  The architecture may use ``vcpu->mode`` to
> > +ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
> > +as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
> > +even to ensure IPI acknowledgements are waited upon (see "Waiting for
> > +Acknowledgements").  The following modes are defined:
> > +
> > +OUTSIDE_GUEST_MODE
> > +
> > +  The VCPU thread is outside guest mode.
> > +
> > +IN_GUEST_MODE
> > +
> > +  The VCPU thread is in guest mode.
> > +
> > +EXITING_GUEST_MODE
> > +
> > +  The VCPU thread is transitioning from IN_GUEST_MODE to
> > +  OUTSIDE_GUEST_MODE.
> > +
> > +READING_SHADOW_PAGE_TABLES
> > +
> > +  The VCPU thread is outside guest mode, but it wants the sender of
> > +  certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
> > +  thread is done reading the page tables.
> > +
> > +VCPU Request Internals
> > +======================
> > +
> > +VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
> > +This means general bitops, like those documented in [atomic-ops]_ could
> > +also be used, e.g. ::
> > +
> > +  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
> > +
> > +However, VCPU request users should refrain from doing so, as it would
> > +break the abstraction.  The first 8 bits are reserved for architecture
> > +independent requests, all additional bits are available for architecture
> > +dependent requests.
> > +
> > +Architecture Independent Requests
> > +---------------------------------
> > +
> > +KVM_REQ_TLB_FLUSH
> > +
> > +  KVM's common MMU notifier may need to flush all of a guest's TLB
> > +  entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
> > +  choose to use the common kvm_flush_remote_tlbs() implementation will
> > +  need to handle this VCPU request.
> > +
> > +KVM_REQ_MMU_RELOAD
> > +
> > +  When shadow page tables are used and memory slots are removed it's
> > +  necessary to inform each VCPU to completely refresh the tables.  This
> > +  request is used for that.
> > +
> > +KVM_REQ_PENDING_TIMER
> > +
> > +  This request may be made from a timer handler run on the host on behalf
> > +  of a VCPU.  It informs the VCPU thread to inject a timer interrupt.
> > +
> > +KVM_REQ_UNHALT
> > +
> > +  This request may be made from the KVM common function kvm_vcpu_block(),
> > +  which is used to emulate an instruction that causes a CPU to halt until
> > +  one of an architectural specific set of events and/or interrupts is
> > +  received (determined by checking kvm_arch_vcpu_runnable()).  When that
> > +  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
> > +  in contrast to when kvm_vcpu_block() returns due to any other reason,
> > +  such as a pending signal, which does not indicate the VCPU's halt
> > +  emulation should stop, and therefore does not make the request.
> > +
> > +KVM_REQUEST_MASK
> > +----------------
> > +
> > +VCPU requests should be masked by KVM_REQUEST_MASK before using them with
> > +bitops.  This is because only the lower 8 bits are used to represent the
> > +request's number.  The upper bits are used as flags.  Currently only two
> > +flags are defined.
> > +
> > +VCPU Request Flags
> > +------------------
> > +
> > +KVM_REQUEST_NO_WAKEUP
> > +
> > +  This flag is applied to a request that does not need immediate
> > +  attention.  When a request does not need immediate attention, and the
> 
> Isn't this an over-simplification?  I thought KVM_REQUEST_NO_WAKEUP was
> only used in cases where you simply want to make sure the VCPU is not
> ignoring a request while exexuting in the guest, but if it's sleeping
> and therefore not in the guest, it can just handle the request whenever
> it wakes up anyway.  So perhaps something like:
> 
>    This flag is applied to a request that only needs immediate attention
>    from VCPUs running in the guest.  That is, sleeping VCPUs do not need
>    to be removed from their waitqueues but can handle this request when
>    they wake up for any other reason.

I like this paragraph. I'll use it in v5.

> 
> > +  VCPU's thread is outside guest mode sleeping, then the thread is not
> > +  awaken by a kick.
> > +
> > +KVM_REQUEST_WAIT
> > +
> > +  When requests with this flag are made with kvm_make_all_cpus_request(),
> > +  then the caller will wait for each VCPU to acknowledge the IPI before
> > +  proceeding.
> 
> How does the interaction work with KVM_REQUEST_NO_WAKEUP?  Does it mean
> it only waits on those VCPUs to which is needs to send an IPI, but the
> rest are ignored ?

I'll discuss interaction with KVM_REQUEST_NO_WAKEUP here and also refer
to the "Waiting for Acknowledgements" section below.

> 
> > +
> > +VCPU Requests with Associated State
> > +===================================
> > +
> > +Requesters that want the receiving VCPU to handle new state need to ensure
> > +the newly written state is observable to the receiving VCPU thread's CPU
> > +by the time it observes the request.  This means a write memory barrier
> > +must be inserted after writing the new state and before setting the VCPU
> > +request bit.  Additionally, on the receiving VCPU thread's side, a
> > +corresponding read barrier must be inserted after reading the request bit
> > +and before proceeding to read the new state associated with it.  See
> > +scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
> > +[memory-barriers]_.
> > +
> > +The pair of functions, kvm_check_request() and kvm_make_request(), provide
> > +the memory barriers, allowing this requirement to be handled internally by
> > +the API.
> > +
> > +Ensuring Requests Are Seen
> > +==========================
> > +
> > +When making requests to VCPUs, we want to avoid the receiving VCPU
> > +executing in guest mode for an arbitrary long time without handling the
> > +request.  We can be sure this won't happen as long as we ensure the VCPU
> > +thread checks kvm_request_pending() before entering guest mode and that a
> > +kick will send an IPI when necessary.  Extra care must be taken to cover
> 
>                         ^ to force an exit from the guest
> 
> > +the period after the VCPU thread's last kvm_request_pending() check and
> > +before it has entered guest mode, as kick IPIs will only trigger VCPU run
> > +loops for VCPU threads that are in guest mode or at least have already
> 
> IPIs trigger VCPU run loops?  I think you mean that IPIs triggers a
> return from guest mode which eventually results in running another
> iteration of the run loop?

yeah, that's what I mean. I'll clarify it.

> 
> > +disabled interrupts in order to prepare to enter guest mode.  This means
> > +that an optimized implementation (see "IPI Reduction") must be certain
> > +when it's safe to not send the IPI.  One solution, which all architectures
> > +except s390 apply, is to:
> > +
> > +- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
> > +  the last kvm_request_pending() check;
> > +- enable interrupts atomically when entering the guest.
> > +
> > +This solution also requires memory barriers to be placed carefully in both
> > +the requesting thread and the receiving VCPU.  With the memory barriers we
> > +can exclude the possibility of a VCPU thread observing
> > +!kvm_request_pending() on its last check and then not receiving an IPI for
> > +the next request made of it, even if the request is made immediately after
> > +the check.  This is done by way of the Dekker memory barrier pattern
> > +(scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
> > +this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
> > +them into the pattern gives::
> > +
> > +  CPU1                                    CPU2
> > +  =================                       =================
> > +  local_irq_disable();
> > +  WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
> > +  smp_mb();                               smp_mb();
> > +  if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
> > +                                              IN_GUEST_MODE) {
> > +      ...abort guest entry...                 ...send IPI...
> > +  }                                       }
> > +
> > +As stated above, the IPI is only useful for VCPU threads in guest mode or
> > +that have already disabled interrupts.  This is why this specific case of
> > +the Dekker pattern has been extended to disable interrupts before setting
> > +``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
> > +pedantically implement the memory barrier pattern, guaranteeing the
> > +compiler doesn't interfere with ``vcpu->mode``'s carefully planned
> > +accesses.
> > +
> > +IPI Reduction
> > +-------------
> > +
> > +As only one IPI is needed to get a VCPU to check for any/all requests,
> > +then they may be coalesced.  This is easily done by having the first IPI
> > +sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
> > +transitional state, EXITING_GUEST_MODE, is used for this purpose.
> > +
> > +Waiting for Acknowledgements
> > +----------------------------
> > +
> > +Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
> > +be sent, and the acknowledgements to be waited upon, even when the target
> > +VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
> > +is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
> > +is set after disabling interrupts.  For these cases, the "should send an
> > +IPI" condition becomes READ_ONCE(``vcpu->mode``) != OUTSIDE_GUEST_MODE.
> 
> Hmm, did you mean, "To support these cases, the condition for sending an
> IPI checks not to be equal to IN_GUEST_MODE, but different from
> OUTSIDE_GUEST_MODE.".

Yup, I'll clarify it.

> 
> (The confusion is whether we check different things depending on which
> request we're dealing with, or just explaining why the single
> implementation is done the way it is.)

Checking different things depends on the request type (requests
flagged with KVM_REQUEST_WAIT). I'll clarify this.

> 
> > +
> > +Request-less VCPU Kicks
> > +-----------------------
> > +
> > +As the determination of whether or not to send an IPI depends on the
> > +two-variable Dekker memory barrier pattern, then it's clear that
> > +request-less VCPU kicks are almost never correct.  Without the assurance
> > +that a non-IPI generating kick will still result in an action by the
> > +receiving VCPU, as the final kvm_request_pending() check does for
> > +request-accompanying kicks, then the kick may not do anything useful at
> > +all.  If, for instance, a request-less kick was made to a VCPU that was
> > +just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
> > +the VCPU thread may continue its entry without actually having done
> > +whatever it was the kick was meant to initiate.
> > +
> > +One exception is x86's posted interrupt mechanism.  In this case, however,
> > +even the request-less VCPU kick is coupled with the same
> > +local_irq_disable() + smp_mb() pattern described above; the ON bit
> > +(Outstanding Notification) in the posted interrupt descriptor takes the
> > +role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
> > +set before reading ``vcpu->mode``; dually, in the VCPU thread,
> > +vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
> > +IN_GUEST_MODE.
> > +
> > +Additional Considerations
> > +=========================
> > +
> > +Sleeping VCPUs
> > +--------------
> > +
> > +VCPU threads may need to consider requests before and/or after calling
> > +functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
> > +do or not, and, if they do, which requests need consideration, is
> > +architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
> > +to check if it should awaken.  One reason to do so is to provide
> > +architectures a function where requests may be checked if necessary.
> > +
> > +Clearing Requests
> > +-----------------
> > +
> > +Generally it only makes sense for the receiving VCPU thread to clear a
> > +request.  However, in some circumstances, such as when the requesting
> > +thread is executing synchronously with the receiving VCPU thread, it's
> 
> what does it mean that the requesting thread is executing synchronously
> with the receiving VCPU thread?  How can that ever be enforced on SMP
> system, if the two threads are not the same thread?
> 
> If we simply mean that some sequence of operations between the two
> threads is synchronized (for example using locks), then that's a
> different (less general) case from the two thread overall executing
> synchronously, I think?

I meant with locks - not a general synchronization, but a temporarily
enforced one. I'll try to clarify that language.

> 
> > +possible to know that the request may be cleared immediately, rather than
> > +waiting for the receiving VCPU thread to handle the request in VCPU RUN.
> > +The only current examples of this are kvm_vcpu_block() calls, where a
> > +side-effect of a call may be to set KVM_REQ_UNHALT.  When the requesting
> > +thread is itself the receiving VCPU, 
> 
> when the requsting thread and the VCPU thread are the same thread ?

I use your language here as it's clearer.

> 
> > then it's possible to know that the
> > +request does not need to be handled in VCPU RUN, and therefore may be
> > +cleared immediately.
> > +
> > +References
> > +==========
> > +
> > +.. [atomic-ops] Documentation/core-api/atomic_ops.rst
> > +.. [memory-barriers] Documentation/memory-barriers.txt
> > +.. [lwn-mb] https://lwn.net/Articles/573436/
> > -- 
> > 2.9.3
> > 
> 
> This write up is excellent.

Thanks!

> 
> My comments are mostly for clarification, so:
> 
> Acked-by: Christoffer Dall <cdall@xxxxxxxxxx>
> 
> 
> Thanks,
> -Christoffer

drew