On Thu, 27 Oct 2022 18:44:51 +0100, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Oct 27, 2022, Marc Zyngier wrote: > > On Tue, 25 Oct 2022 18:47:12 +0100, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > Call us incompetent, but I have zero confidence that KVM will never > > > unintentionally add a path that invokes mark_page_dirty_in_slot() > > > without a running vCPU. > > > > Well, maybe it is time that KVM acknowledges there is a purpose to > > dirtying memory outside of a vcpu context, and that if a write happens > > in a vcpu context, this vcpu must be explicitly passed down rather > > than obtained from kvm_get_running_vcpu(). Yes, this requires some > > heavy surgery. > > Heh, preaching to the choir on this one. > > On Mon, Dec 02, 2019 at 12:10:36PM -0800, Sean Christopherson wrote: > > IMO, adding kvm_get_running_vcpu() is a hack that is just asking for future > > abuse and the vcpu/vm/as_id interactions in mark_page_dirty_in_ring() look > > extremely fragile. > > I'm all in favor of not using kvm_get_running_vcpu() in this path. > > That said, it's somewhat of an orthogonal issue, as I would still > want a sanity check in mark_page_dirty_in_slot() that a vCPU is > provided when there is no dirty bitmap. If we have a separate context and/or API, then all these checks become a lot less controversial, and we can start reasoning about these things. At the moment, this is just a mess. > > > > By completely dropping the rule that KVM must have an active vCPU on > > > architectures that support ring+bitmap, those types of bugs will go > > > silently unnoticed, and will manifest as guest data corruption after > > > live migration. > > > > The elephant in the room is still userspace writing to its view of the > > guest memory for device emulation. Do they get it right? I doubt it. > > I don't see what that has to do with KVM though. There are many > things userspace needs to get right, that doesn't mean that KVM > shouldn't strive to provide safeguards for the functionality that > KVM provides. I guess we have different expectations of what KVM should provide. My take is that userspace doesn't need a nanny, and that a decent level of documentation should make it obvious what feature captures which state. But we've argued for a while now, and I don't see that we're getting any closer to a resolution. So let's at least make some forward progress with the opt-out mechanism you mentioned, and arm64 will buy into it when snapshoting the ITS. > > > > And ideally such bugs would detected without relying on userspace to > > > enabling dirty logging, e.g. the Hyper-V bug lurked for quite some > > > time and was only found when mark_page_dirty_in_slot() started > > > WARNing. > > > > > > I'm ok if arm64 wants to let userspace shoot itself in the foot with > > > the ITS, but I'm not ok dropping the protections in the common > > > mark_page_dirty_in_slot(). > > > > > > One somewhat gross idea would be to let architectures override the > > > "there must be a running vCPU" rule, e.g. arm64 could toggle a flag > > > in kvm->arch in its kvm_write_guest_lock() to note that an expected > > > write without a vCPU is in-progress: > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 8c5c69ba47a7..d1da8914f749 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -3297,7 +3297,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > > > struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > > > > > #ifdef CONFIG_HAVE_KVM_DIRTY_RING > > > - if (WARN_ON_ONCE(!vcpu) || WARN_ON_ONCE(vcpu->kvm != kvm)) > > > + if (!kvm_arch_allow_write_without_running_vcpu(kvm) && WARN_ON_ONCE(!vcpu)) > > > + return; > > > + > > > + if (WARN_ON_ONCE(vcpu && vcpu->kvm != kvm)) > > > return; > > > #endif > > > > > > @@ -3305,10 +3308,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm, > > > unsigned long rel_gfn = gfn - memslot->base_gfn; > > > u32 slot = (memslot->as_id << 16) | memslot->id; > > > > > > - if (kvm->dirty_ring_size) > > > + if (kvm->dirty_ring_size && vcpu) > > > kvm_dirty_ring_push(&vcpu->dirty_ring, > > > slot, rel_gfn); > > > - else > > > + else if (memslot->dirty_bitmap) > > > set_bit_le(rel_gfn, memslot->dirty_bitmap); > > > } > > > } > > > > I think this is equally wrong. Writes occur from both CPUs and devices > > *concurrently*, and I don't see why KVM should keep ignoring this > > pretty obvious fact. > > > > Yes, your patch papers over the problem, and it can probably work if > > the kvm->arch flag only gets set in the ITS saving code, which is > > already exclusive of vcpus running. > > > > But in the long run, with dirty bits being collected from the IOMMU > > page tables or directly from devices, we will need a way to reconcile > > the dirty tracking. The above doesn't quite cut it, unfortunately. > > Oooh, are you referring to IOMMU page tables and devices _in the > guest_? E.g. if KVM itself were to emulate a vIOMMU, then KVM would > be responsible for updating dirty bits in the vIOMMU page tables. No. I'm talking about the *physical* IOMMU, which is (with the correct architecture revision and feature set) capable of providing its own set of dirty bits, on a per-device, per-PTE basis. Once we enable that, we'll need to be able to sink these bits into the bitmap and provide a unified view of the dirty state to userspace. > Not that it really matters, but do we actually expect KVM to ever > emulate a vIOMMU? On x86 at least, in-kernel acceleration of vIOMMU > emulation seems more like VFIO territory. I don't expect KVM/arm64 to fully emulate an IOMMU, but at least to eventually provide the required filtering to enable a stage-1 SMMU to be passed to a guest. This is the sort of things pKVM needs to implement for the host anyway, and going the extra mile to support arbitrary guests outside of the pKVM context isn't much more work. > Regardless, I don't think the above idea makes it any more difficult > to support in-KVM emulation of non-CPU stuff, which IIUC is the ITS > case. I 100% agree that the above is a hack, but that's largely due > to the use of kvm_get_running_vcpu(). That I agree. > A slightly different alternative would be have a completely separate > API for writing guest memory without an associated vCPU. I.e. start > building up proper device emulation support. Then the vCPU-based > APIs could yell if a vCPU isn't provided (or there is no running > vCPU in the current mess). And the deviced-based API could be > provided if and only if the architecture actually supports emulating > writes from devices, i.e. x86 would not opt-in and so would even > have access to the API. Which is what I was putting under the "major surgery" label in my previous email. Anyhow, for the purpose of unblocking Gavin's series, I suggest to adopt your per-arch opt-out suggestion as a stop gap measure, and we will then be able to bike-shed for weeks on what the shape of the device-originated memory dirtying API should be. M. -- Without deviation from the norm, progress is not possible.