On Wed, 2022-01-12 at 11:10 +0800, Peter Xu wrote: > On Sat, Jan 08, 2022 at 12:26:32AM +0000, Sean Christopherson wrote: > > +Peter > > > > On Wed, Jan 05, 2022, David Woodhouse wrote: > > > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote: > > > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > > > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > > > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". > > > > > > Even with the WARN to keep us honest, this is still awful. > > > > > > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored > > > and only vcpu->kvm is used. But you have to be running on a physical > > > CPU which currently owns *a* vCPU of that KVM, or you might crash. > > > > > > There is also kvm_write_guest() which doesn't take a vCPU as an > > > argument, and looks for all the world like it was designed for you not > > > to need one... but which still needs there to be a vCPU or it might > > > crash. > > Right, I agree too those are slightly confusing. > > > > > > > I think I want to kill the latter, make the former use the vCPU it's > > > given, add a spinlock to the dirty ring which will be uncontended > > > anyway in the common case so it shouldn't hurt (?), > > > > IIRC, Peter did a fair amount of performance testing and analysis that led to > > the current behavior. > > Yes that comes out from the discussion when working on dirty ring. I can't > remember the details too, but IIRC what we figured is 99% cases of guest > dirtying pages are with vcpu context. > > The only outlier at that time (at least for x86) was kvm-gt who uses direct kvm > context but then it turns out that's an illegal use of the kvm API and kvm-gt > fixed itself instead. > > Then we come to the current kvm_get_running_vcpu() approach because all the > dirty track contexts are within an existing vcpu context. Dirty ring could be > simplified too because we don't need the default vcpu0 trick, nor do we need an > extra ring just to record no-vcpu contexts (in very early version of dirty ring > we do have N+1 rings, where N is the vcpu number, and the extra is for this). > > kvm_get_running_vcpu() also has a slight benefit that it guarantees no spinlock > needed. > > > > > > and then let people use kvm->vcpu[0] when they really need to, with a > > > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking > > > might not work. Which is fine, as migration of a KVM that hasn't been fully > > > set up yet is silly. > > > > "when they really need to" can be a slippery slope, using vcpu[0] is also quite > > gross. Though I 100% agree that always using kvm_get_running_vcpu() is awful. > > I agreed none of them looks like an extreme clean approach. > > So far I'd hope we can fix the problem with what Sean suggested on postponing > the page update until when we have the vcpu context, so we keep the assumption > still true on "having a vcpu context" at least for x86 and that'll be the > simplest, afaiu. > > Or do we have explicit other requirement that needs to dirty guest pages > without vcpu context at all? Delivering interrupts may want to do so. That's the one we hit for S390, and I only avoided it for Xen event channel delivery on x86 by declaring that the Xen shared info page is exempt from dirty tracking and should *always* be considered dirty.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature