Re: [PATCH v2 2/9] KVM: Add documentation for VCPU requests

Christoffer Dall <cdall@xxxxxxxxxx> · Tue, 4 Apr 2017 17:24:03 +0200

Hi Drew,

On Fri, Mar 31, 2017 at 06:06:51PM +0200, Andrew Jones wrote:
> Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx>
> ---
>  Documentation/virtual/kvm/vcpu-requests.rst | 114 ++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/vcpu-requests.rst
> 
> diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst
> new file mode 100644
> index 000000000000..ea4a966d5c8a
> --- /dev/null
> +++ b/Documentation/virtual/kvm/vcpu-requests.rst
> @@ -0,0 +1,114 @@
> +=================
> +KVM VCPU Requests
> +=================
> +
> +Overview
> +========
> +
> +KVM supports an internal API enabling threads to request a VCPU thread to
> +perform some activity.  For example, a thread may request a VCPU to flush
> +its TLB with a VCPU request.  The API consists of only four calls::
> +
> +  /* Check if VCPU @vcpu has request @req pending. Clears the request. */
> +  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
> +
> +  /* Check if any requests are pending for VCPU @vcpu. */
> +  bool kvm_request_pending(struct kvm_vcpu *vcpu);
> +
> +  /* Make request @req of VCPU @vcpu. */
> +  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
> +
> +  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
> +  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
> +
> +Typically a requester wants the VCPU to perform the activity as soon
> +as possible after making the request.  This means most requests,
> +kvm_make_request() calls, are followed by a call to kvm_vcpu_kick(),
> +and kvm_make_all_cpus_request() has the kicking of all VCPUs built
> +into it.
> +
> +VCPU Kicks
> +----------
> +
> +A VCPU kick does one of three things:
> +
> + 1) wakes a sleeping VCPU (which sleeps outside guest mode).

You could clarify this to say that a sleeping VCPU is a VCPU thread
which is not runnable and placed on waitqueue, and waking it makes
the thread runnable again.

> + 2) sends an IPI to a VCPU currently in guest mode, in order to bring it
> +    out.
> + 3) nothing, when the VCPU is already outside guest mode and not sleeping.
> +
> +VCPU Request Internals
> +======================
> +
> +VCPU requests are simply bit indices of the vcpu->requests bitmap.  This
> +means general bitops[1], e.g. clear_bit(KVM_REQ_UNHALT, &vcpu->requests),
> +may also be used.  The first 8 bits are reserved for architecture
> +independent requests, all additional bits are available for architecture
> +dependent requests.

Should we explain the ones that are generically defined and how they're
supposed to be used?  For example, we don't use them on ARM, and I don't
think I understand why another thread would ever make a PENDING_TIMER
request on a vcpu?

> +
> +VCPU Requests with Associated State
> +===================================
> +
> +Requesters that want the requested VCPU to handle new state need to ensure
> +the state is observable to the requested VCPU thread's CPU at the time the

nit: need to ensure that the newly written state is observable ... by
the time it observed the request.

> +CPU observes the request.  This means a write memory barrier should be
                                                                 ^^^
							         must

> +insert between the preparation of the state and the write of the VCPU
    ^^^
   inserted

I would rephrase this as: '... after writing the new state to memory and
before setting the VCPU request bit.'

> +request bitmap.  Additionally, on the requested VCPU thread's side, a
> +corresponding read barrier should be issued after reading the request bit
                                ^^^       ^^^
			       must      inserted (for consistency)

> +and before proceeding to use the state associated with it.  See the kernel
                            ^^^    ^
		           read    new

> +memory barrier documentation [2].

I think it would be great if this document explains if this is currently
taken care of by the API you explain above or if there are cases where
people have to explicitly insert these barriers, and in that case, which
barriers they should use (if we know at this point already).

> +
> +VCPU Requests and Guest Mode
> +============================
> +

I feel like an intro about the overall goal here is missing.  How about
something like this:

  When making requests to VCPUs, we want to avoid the receiving VCPU
  executing inside the guest for an arbitrary long time without handling
  the request.  The way we prevent this from happening is by keeping
  track of when a VCPU is running and sending an IPI to the physical CPU
  running the VCPU when that is the case.  However, each architecture
  implementation of KVM must take great care to ensure that requests are
  not missed when a VCPU stops running at the same time when a request
  is received.

Also, I'm not sure what the semantics are with kvm_vcpu_block().  Is it
ok to send a request to a VCPU and then the VCPU blocks and goes to
sleep forever even though there are pending requests?
kvm_vcpu_check_block() doesn't seem to check vcpu->requests which would
indicate that this is the case, but maybe architectures that actually do
use requests implement something else themselves?

> +As long as the guest is either in guest mode, in which case it gets an IPI

guest is in guest mode?

Perhaps this could be more clearly written as:

As long as the VCPU is running, it is marked as having vcpu->mode =
IN_GUEST MODE.  A requesting thread observing IN_GUEST_MODE will send an
IPI to the CPU running the VCPU thread.  On the other hand, when a
requesting thread observes vcpu->mode == OUTSIDE_GUEST_MODE, it will not send
any IPIs, but will simply set the request bit, a the VCPU thread will be
able to check the requests before running the VCPU again.  However, the
transition...

> +and will definitely see the request, or is outside guest mode, but has yet
> +to do its final request check, and therefore when it does, it will see the
> +request, then things will work.  However, the transition from outside to
> +inside guest mode, after the last request check has been made, opens a
> +window where a request could be made, but the VCPU would not see until it
> +exits guest mode some time later.  See the table below.

This text, and the table below, only deals with the details of entering
the guest.  Should we talk about kvm_vcpu_exiting_guest_mode() and
anything related to exiting the guest?

> +
> ++------------------+-----------------+----------------+--------------+
> +| vcpu->mode       | done last check | kick sends IPI | request seen |
> ++==================+=================+================+==============+
> +| IN_GUEST_MODE    |      N/A        |      YES       |     YES      |
> ++------------------+-----------------+----------------+--------------+
> +| !IN_GUEST_MODE   |      NO         |      NO        |     YES      |
> ++------------------+-----------------+----------------+--------------+
> +| !IN_GUEST_MODE   |      YES        |      NO        |     NO       |
> ++------------------+-----------------+----------------+--------------+
> +
> +To ensure the third scenario shown in the table above cannot happen, we
> +need to ensure the VCPU's mode change is observable by all CPUs prior to
> +its final request check and that a requester's request is observable by
> +the requested VCPU prior to the kick.  To do that we need general memory
> +barriers between each pair of operations involving mode and requests, i.e.
> +
> +  CPU_i                                  CPU_j
> +-------------------------------------------------------------------------
> +  vcpu->mode = IN_GUEST_MODE;            kvm_make_request(REQ, vcpu);
> +  smp_mb();                              smp_mb();
> +  if (kvm_request_pending(vcpu))         if (vcpu->mode == IN_GUEST_MODE)
> +      handle_requests();                     send_IPI(vcpu->cpu);
> +
> +Whether explicit barriers are needed, or reliance on implicit barriers is
> +sufficient, is architecture dependent.  Alternatively, an architecture may
> +choose to just always send the IPI, as not sending it, when it's not
> +necessary, is just an optimization.

Is this universally true?  This is certainly true on ARM, because we
disable interrupts before doing all this, so the IPI remains pending and
causes an immediate exit, but if any of the above is done with
interrupts enabled, just sending an IPI does nothing to ensure the
request is observed.  Perhaps this is not a case we should care about.

> +
> +Additionally, the error prone third scenario described above also exhibits
> +why a request-less VCPU kick is almost never correct.  Without the
> +assurance that a non-IPI generating kick will still result in an action by
> +the requested VCPU, as the final kvm_request_pending() check does, then
> +the kick may not initiate anything useful at all.  If, for instance, a
> +request-less kick was made to a VCPU that was just about to set its mode
> +to IN_GUEST_MODE, meaning no IPI is sent, then the VCPU may continue its
> +entry without actually having done whatever it was the kick was meant to
> +initiate.

Indeed.

> +
> +References
> +==========
> +
> +[1] Documentation/core-api/atomic_ops.rst
> +[2] Documentation/memory-barriers.txt
> -- 
> 2.9.3
> 

This is a great writeup!  I enjoyed reading it and it made me think more
carefully about a number of things, so I definitely think we should
merge this.

Thanks,
-Christoffer