On Tue, Apr 04, 2017 at 05:24:03PM +0200, Christoffer Dall wrote: > Hi Drew, > > On Fri, Mar 31, 2017 at 06:06:51PM +0200, Andrew Jones wrote: > > Signed-off-by: Andrew Jones <drjones@xxxxxxxxxx> > > --- > > Documentation/virtual/kvm/vcpu-requests.rst | 114 ++++++++++++++++++++++++++++ > > 1 file changed, 114 insertions(+) > > create mode 100644 Documentation/virtual/kvm/vcpu-requests.rst > > > > diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst > > new file mode 100644 > > index 000000000000..ea4a966d5c8a > > --- /dev/null > > +++ b/Documentation/virtual/kvm/vcpu-requests.rst > > @@ -0,0 +1,114 @@ > > +================= > > +KVM VCPU Requests > > +================= > > + > > +Overview > > +======== > > + > > +KVM supports an internal API enabling threads to request a VCPU thread to > > +perform some activity. For example, a thread may request a VCPU to flush > > +its TLB with a VCPU request. The API consists of only four calls:: > > + > > + /* Check if VCPU @vcpu has request @req pending. Clears the request. */ > > + bool kvm_check_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* Check if any requests are pending for VCPU @vcpu. */ > > + bool kvm_request_pending(struct kvm_vcpu *vcpu); > > + > > + /* Make request @req of VCPU @vcpu. */ > > + void kvm_make_request(int req, struct kvm_vcpu *vcpu); > > + > > + /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ > > + bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); > > + > > +Typically a requester wants the VCPU to perform the activity as soon > > +as possible after making the request. This means most requests, > > +kvm_make_request() calls, are followed by a call to kvm_vcpu_kick(), > > +and kvm_make_all_cpus_request() has the kicking of all VCPUs built > > +into it. > > + > > +VCPU Kicks > > +---------- > > + > > +A VCPU kick does one of three things: > > + > > + 1) wakes a sleeping VCPU (which sleeps outside guest mode). > > You could clarify this to say that a sleeping VCPU is a VCPU thread > which is not runnable and placed on waitqueue, and waking it makes > the thread runnable again. > > > + 2) sends an IPI to a VCPU currently in guest mode, in order to bring it > > + out. > > + 3) nothing, when the VCPU is already outside guest mode and not sleeping. > > + > > +VCPU Request Internals > > +====================== > > + > > +VCPU requests are simply bit indices of the vcpu->requests bitmap. This > > +means general bitops[1], e.g. clear_bit(KVM_REQ_UNHALT, &vcpu->requests), > > +may also be used. The first 8 bits are reserved for architecture > > +independent requests, all additional bits are available for architecture > > +dependent requests. > > Should we explain the ones that are generically defined and how they're > supposed to be used? For example, we don't use them on ARM, and I don't > think I understand why another thread would ever make a PENDING_TIMER > request on a vcpu? Yes, I agree the general requests should be described. I'll have to figure out how :-) Describing KVM_REQ_UNHALT will likely lead to a subsection on kvm_vcpu_block(), as you bring up below. > > > + > > +VCPU Requests with Associated State > > +=================================== > > + > > +Requesters that want the requested VCPU to handle new state need to ensure > > +the state is observable to the requested VCPU thread's CPU at the time the > > nit: need to ensure that the newly written state is observable ... by > the time it observed the request. > > > +CPU observes the request. This means a write memory barrier should be > ^^^ > must > > > +insert between the preparation of the state and the write of the VCPU > ^^^ > inserted > > I would rephrase this as: '... after writing the new state to memory and > before setting the VCPU request bit.' > > > > +request bitmap. Additionally, on the requested VCPU thread's side, a > > +corresponding read barrier should be issued after reading the request bit > ^^^ ^^^ > must inserted (for consistency) > > > > > +and before proceeding to use the state associated with it. See the kernel > ^^^ ^ > read new > > > > +memory barrier documentation [2]. > > I think it would be great if this document explains if this is currently > taken care of by the API you explain above or if there are cases where > people have to explicitly insert these barriers, and in that case, which > barriers they should use (if we know at this point already). Will do. The current API does take care of it. I'll state that. I'd have to grep around to see if there are any non-API users that also need barriers, but as they could change, I probably wouldn't want to call them out is the doc. So I guess I'll still just wave my hand at that type of use. > > > + > > +VCPU Requests and Guest Mode > > +============================ > > + > > I feel like an intro about the overall goal here is missing. How about > something like this: > > When making requests to VCPUs, we want to avoid the receiving VCPU > executing inside the guest for an arbitrary long time without handling > the request. The way we prevent this from happening is by keeping > track of when a VCPU is running and sending an IPI to the physical CPU > running the VCPU when that is the case. However, each architecture > implementation of KVM must take great care to ensure that requests are > not missed when a VCPU stops running at the same time when a request > is received. > > Also, I'm not sure what the semantics are with kvm_vcpu_block(). Is it > ok to send a request to a VCPU and then the VCPU blocks and goes to > sleep forever even though there are pending requests? > kvm_vcpu_check_block() doesn't seem to check vcpu->requests which would > indicate that this is the case, but maybe architectures that actually do > use requests implement something else themselves? I'll add a kvm_vcpu_block() subsection as part of the KVM_REQ_UNHALT documentation. > > > +As long as the guest is either in guest mode, in which case it gets an IPI > > guest is in guest mode? oops, s/guest/vcpu/ > > Perhaps this could be more clearly written as: > > As long as the VCPU is running, it is marked as having vcpu->mode = > IN_GUEST MODE. A requesting thread observing IN_GUEST_MODE will send an > IPI to the CPU running the VCPU thread. On the other hand, when a > requesting thread observes vcpu->mode == OUTSIDE_GUEST_MODE, it will not send > any IPIs, but will simply set the request bit, a the VCPU thread will be > able to check the requests before running the VCPU again. However, the > transition... > > > +and will definitely see the request, or is outside guest mode, but has yet > > +to do its final request check, and therefore when it does, it will see the > > +request, then things will work. However, the transition from outside to > > +inside guest mode, after the last request check has been made, opens a > > +window where a request could be made, but the VCPU would not see until it > > +exits guest mode some time later. See the table below. > > This text, and the table below, only deals with the details of entering > the guest. Should we talk about kvm_vcpu_exiting_guest_mode() and > anything related to exiting the guest? I think all !IN_GUEST_MODE should behave the same, so I was avoiding the use of EXITING_GUEST_MODE and OUTSIDE_GUEST_MODE, which wouldn't be hard to address, but then I'd also have to address READING_SHADOW_PAGE_TABLES, which may complicate the document more than necessary. I'm not sure we need to address a VCPU exiting guest mode, other than making sure it's clear that a VCPU that exits must check requests before it enters again. > > > + > > ++------------------+-----------------+----------------+--------------+ > > +| vcpu->mode | done last check | kick sends IPI | request seen | > > ++==================+=================+================+==============+ > > +| IN_GUEST_MODE | N/A | YES | YES | > > ++------------------+-----------------+----------------+--------------+ > > +| !IN_GUEST_MODE | NO | NO | YES | > > ++------------------+-----------------+----------------+--------------+ > > +| !IN_GUEST_MODE | YES | NO | NO | > > ++------------------+-----------------+----------------+--------------+ > > + > > +To ensure the third scenario shown in the table above cannot happen, we > > +need to ensure the VCPU's mode change is observable by all CPUs prior to > > +its final request check and that a requester's request is observable by > > +the requested VCPU prior to the kick. To do that we need general memory > > +barriers between each pair of operations involving mode and requests, i.e. > > + > > + CPU_i CPU_j > > +------------------------------------------------------------------------- > > + vcpu->mode = IN_GUEST_MODE; kvm_make_request(REQ, vcpu); > > + smp_mb(); smp_mb(); > > + if (kvm_request_pending(vcpu)) if (vcpu->mode == IN_GUEST_MODE) > > + handle_requests(); send_IPI(vcpu->cpu); > > + > > +Whether explicit barriers are needed, or reliance on implicit barriers is > > +sufficient, is architecture dependent. Alternatively, an architecture may > > +choose to just always send the IPI, as not sending it, when it's not > > +necessary, is just an optimization. > > Is this universally true? This is certainly true on ARM, because we > disable interrupts before doing all this, so the IPI remains pending and > causes an immediate exit, but if any of the above is done with > interrupts enabled, just sending an IPI does nothing to ensure the > request is observed. Perhaps this is not a case we should care about. I'll try to make this less generic, as some architectures may not work this way. Indeed, s390 doesn't seem to have kvm_vcpu_kick(), so I guess things don't work this way for them. > > > + > > +Additionally, the error prone third scenario described above also exhibits > > +why a request-less VCPU kick is almost never correct. Without the > > +assurance that a non-IPI generating kick will still result in an action by > > +the requested VCPU, as the final kvm_request_pending() check does, then > > +the kick may not initiate anything useful at all. If, for instance, a > > +request-less kick was made to a VCPU that was just about to set its mode > > +to IN_GUEST_MODE, meaning no IPI is sent, then the VCPU may continue its > > +entry without actually having done whatever it was the kick was meant to > > +initiate. > > Indeed. > > > > + > > +References > > +========== > > + > > +[1] Documentation/core-api/atomic_ops.rst > > +[2] Documentation/memory-barriers.txt > > -- > > 2.9.3 > > > > This is a great writeup! I enjoyed reading it and it made me think more > carefully about a number of things, so I definitely think we should > merge this. > Thanks Christoffer! I'll take all your suggestions above and try to answer your questions for v2. drew