Re: [PATCH] KVM: x86: Include host suspended time in steal time.

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 14 Aug 2024 08:35:35 -0700

On Wed, Aug 14, 2024, Suleiman Souhlal wrote:
> On Wed, Aug 14, 2024 at 2:06 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > Additionally, it seems that if a guest migrates to another system after a
> > > > suspend and before updating steal time, the suspended time is lost during
> > > > migration. I'm not sure if this is a practical issue.
> > >
> > > The systems where the host suspends don't usually do VM migrations. Or at
> > > least the ones where we're encountering the problem this patch is trying to
> > > address don't (laptops).
> > >
> > > But even if they did, it doesn't seem that likely that the migration would
> > > happen over a host suspend.
> >
> > I think we want to account for this straightaway, or at least have defined and
> > documented behavior, else we risk rehashing the issues with marking a vCPU as
> > preempted when it's loaded, but not running.  Which causes problems for live
> > migration as it results in KVM marking the steal-time page as dirty after vCPUs
> > have been paused.
> >
> > [*] https://lkml.kernel.org/r/20240503181734.1467938-4-dmatlack%40google.com
> 
> Can you explain how the steal-time page could get marked as dirty after VCPUs
> have been paused? From what I can tell, record_steal_time() gets called from
> vcpu_enter_guest(), which shouldn't happen when the VCPU has been paused, but
> I have to admit I don't really know anything about how live migration works.

It's not record_steal_time(), it's kvm_steal_time_set_preempted().  The flag
KVM uses to tell the guest that the vCPU has been scheduled out, KVM_VCPU_PREEMPTED,
resides in the kvm_steal_time structure, i.e. in the steal-time page.

Userspace "pauses" vCPUs when it enters blackout to complete live migration.  After
pausing vCPUs, the VMM invokes various KVM_GET_* ioctls to retrieve vCPU state
so that it can be transfered to the destination.  Without the above series, KVM
marks vCPUs as preempted when the associated task is scheduled out and the vCPU
is "loaded", even if the vCPU is not actively running.  This results in KVM writing
to kvm_steal_time.preempted and dirtying the page, after userspace thinks it
should be impossible for KVM to dirty guest memory (because vCPUs are no longer
being run).

> The series you linked is addressing an issue when the steal-time page gets
> written to outside of record_steal_time(), but we aren't doing this for this
> proposed patch.

I know.  What I am saying is that I don't want to punt on the issue Chao raised,
because _if_ we want to properly account suspend time when a vCPU is migrated
(or saved/restored for any reason) without doing KVM_RUN after suspect, then that
either requires updating the steal-time information outside of KVM_RUN, or it
requires new uAPI to explicitly migrate the unaccounted suspend timd.

Given that new uAPI is generally avoided when possible, that makes updating
steal-time outside of KVM_RUN the default choice (which isn,t necessarily the
best choice), which in turn means KVM now has to worry about the above scenario
of writing to guest memory after vCPUs have been paused by userespace.

> With the proposed approach, the steal time page would get copied to the new
> host and everything would keep working correctly, with the exception of a
> possible host suspend happening between when the migration started and when
> it finishes, not being reflected post-migration.  That seems like a
> reasonable compromise.

Maybe, but I'm not keen on sweeping this under the rug.  Ignoring issues because
they'll "never" happen has bitten KVM more than once.

At the absolute bare minimum, the flaw needs to be documented, with a suggested
workaround provided (do KVM on all vCPUs before migrating after suspend), e.g.
so that userspace can workaround the issue in the unlikely scenario userspace
does suspend+resume, saves/restores a VM, *and* cares about steal-time.

Even better would be if we can figure out a way to effectively require KVM_RUN
after suspend+resume, but I can't think of a way to do that without breaking
userspace or adding new uAPI, and adding new uAPI for this feels like overkill.