Hi, On Wed, Jul 10, 2024 at 04:44:10PM +0900, Suleiman Souhlal wrote: >When the host resumes from a suspend, the guest thinks any task >that was running during the suspend ran for a long time, even though >the effective run time was much shorter, which can end up having >negative effects with scheduling. This can be particularly noticeable >if the guest task was RT, as it can end up getting throttled for a >long time. > >To mitigate this issue, we include the time that the host was >suspended in steal time, which lets the guest can subtract the >duration from the tasks' runtime. > >Signed-off-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> >--- > arch/x86/kvm/x86.c | 23 ++++++++++++++++++++++- > include/linux/kvm_host.h | 4 ++++ > 2 files changed, 26 insertions(+), 1 deletion(-) > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >index 0763a0f72a067f..94bbdeef843863 100644 >--- a/arch/x86/kvm/x86.c >+++ b/arch/x86/kvm/x86.c >@@ -3669,7 +3669,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) > struct kvm_steal_time __user *st; > struct kvm_memslots *slots; > gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; >- u64 steal; >+ u64 steal, suspend_duration; > u32 version; > > if (kvm_xen_msr_enabled(vcpu->kvm)) { >@@ -3696,6 +3696,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu) > return; > } > >+ suspend_duration = 0; >+ if (READ_ONCE(vcpu->suspended)) { >+ suspend_duration = vcpu->kvm->last_suspend_duration; >+ vcpu->suspended = 0; Can you explain why READ_ONCE() is necessary here, but WRITE_ONCE() isn't used for clearing vcpu->suspended? >+ } >+ > st = (struct kvm_steal_time __user *)ghc->hva; > /* > * Doing a TLB flush here, on the guest's behalf, can avoid >@@ -3749,6 +3755,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) > unsafe_get_user(steal, &st->steal, out); > steal += current->sched_info.run_delay - > vcpu->arch.st.last_steal; >+ steal += suspend_duration; > vcpu->arch.st.last_steal = current->sched_info.run_delay; > unsafe_put_user(steal, &st->steal, out); > >@@ -6920,6 +6927,7 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm) > > mutex_lock(&kvm->lock); > kvm_for_each_vcpu(i, vcpu, kvm) { >+ WRITE_ONCE(vcpu->suspended, 1); > if (!vcpu->arch.pv_time.active) > continue; > >@@ -6932,15 +6940,28 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm) > } > mutex_unlock(&kvm->lock); > >+ kvm->suspended_time = ktime_get_boottime_ns(); >+ > return ret ? NOTIFY_BAD : NOTIFY_DONE; > } > >+static int >+kvm_arch_resume_notifier(struct kvm *kvm) >+{ >+ kvm->last_suspend_duration = ktime_get_boottime_ns() - >+ kvm->suspended_time; Is it possible that a vCPU doesn't get any chance to run (i.e., update steal time) between two suspends? In this case, only the second suspend would be recorded. Maybe we need an infrastructure in the PM subsystem to record accumulated suspended time. When updating steal time, KVM can add the additional suspended time since the last update into steal_time (as how KVM deals with current->sched_info.run_deley). This way, the scenario I mentioned above won't be a problem and KVM needn't calculate the suspend duration for each guest. And this approach can potentially benefit RISC-V and ARM as well, since they have the same logic as x86 regarding steal_time. Additionally, it seems that if a guest migrates to another system after a suspend and before updating steal time, the suspended time is lost during migration. I'm not sure if this is a practical issue.