Re: 3 preempted variables in kvm

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 22 Jan 2021 11:30:04 -0800

On Fri, Jan 22, 2021, Alex Shi wrote:
> Hi All,
> 
> I am newbie on KVM side, so probably I am wrong on the following.
> Please correct me if it is.
> 
> There are 3 preempted variables in kvm:
>      1, kvm_vcpu.preempted  in include/linux/kvm_host.h
>      2, kvm_steal_time.preempted
>      3, kvm_vcpu_arch.st.preempted in arch/x86
> Seems all of them are set or cleared at the same time. Like,

Not quite.  kvm_vcpu.preempted is set only in kvm_sched_out(), i.e. when the
vCPU was running and preempted by the host scheduler.  This is used by KVM when
KVM detects that a guest task appears to be waiting on a lock, in which case KVM
will bump the priority of preempted guest kernel threads in the hope that
scheduling in the preempted vCPU will release the lock.

kvm_steal_time.preempted is a paravirt struct that is shared with the guest.  It
is set on any call to kvm_arch_vcpu_put(), which covers kvm_sched_out() and adds
the case where the vCPU exits to userspace, e.g. for IO.  KVM itself hasn't been
preempted, but from the guest's perspective the CPU has been "preempted" in the
sense that CPU (from the guest's perspective) is not executing guest code.
Similar to KVM's vCPU scheduling heuristics, the guest kernel uses this info to
inform its scheduling, e.g. to avoid waiting on a lock owner to drop the lock
since the lock owner is not actively running.

kvm_vcpu_arch.st.preempted is effectively a host-side cache of
kvm_steal_time.preempted that's used to optimize kvm_arch_vcpu_put() by avoiding
the moderately costly mapping of guest.  It could be dropped, but it's a single
byte per vCPU so worth keeping even if the performance benefits are modest.

> vcpu_put:
>         kvm_sched_out()-> set 3 preempted
>                 kvm_arch_vcpu_put():
>                         kvm_steal_time_set_preempted
> 
> vcpu_load:
>         kvm_sched_in() : clear above 3 preempted
>                 kvm_arch_vcpu_load() -> kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>                 request dealed in vcpu_enter_guest() -> record_steal_time
> 
> Except the 2nd one reuse with KVM_FEATURE_PV_TLB_FLUSH bit which could be used
> separately, Could we combine them into one, like just bool kvm_vcpu.preempted? and 
> move out the KVM_FEATURE_PV_TLB_FLUSH. Believe all arch need this for a vcpu overcommit.

Moving KVM_VCPU_FLUSH_TLB out of kvm_steal_time.preempted isn't viable. The
guest kernel is only allowed to rely on the host to flush the vCPU's TLB if it
knows the vCPU is preempted (from its perspective), as that's the only way it
can guarantee that KVM will observe the TLB flush request before enterring the
vCPU.  KVM_VCPU_FLUSH_TLB and KVM_VCPU_PREEMPTED need to be in the same word so
KVM can read and clear them atomically, otherwise there would be a window where
KVM would miss the flush request.