On Thu, 2021-11-18 at 13:04 +0100, Paolo Bonzini wrote: > On 11/17/21 22:09, David Woodhouse wrote: > > > { > > > - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); > > > + struct kvm_vcpu *running_vcpu = kvm_get_running_vcpu(); > > > > > > + WARN_ON_ONCE(vcpu && vcpu != running_vcpu); > > > WARN_ON_ONCE(vcpu->kvm != kvm); > > Ah, that one needs to be changed to check running_vcpu instead. Or this > > needs to go first: > > > > I think I prefer making the vCPU a required argument. If anyone's going to > > pull a vCPU pointer out of their posterior, let the caller do it. > > > > I understand that feeling, but still using the running vCPU is by far > the common case, and it's not worth adding a new function parameter to > all call sites. > > What about using a separate function, possibly __-prefixed, for the case > where you have a very specific vCPU? We could certainly do a kvm_vcpu_mark_page_dirty_in_slot() and reduce the collateral damage. I think this patch wouldn't need to touch anything outside virt/kvm/ if we did that. But bikeshedding aside, it occurs to me that I have a more substantial concern about this approach — are we even *allowed* to touch the dirty ring of another vCPU? It seems to be based on the assumption that it's entirely a per-cpu structure on the kernel side. Wouldn't we need a spinlock in kvm_dirty_ring_push() at the very least? At this point I'm somewhat tempted to remove the mark_page_dirty() call from gfn_to_pfn_cache_invalidate_start() and move it to the tail of kvm_gfn_to_pfn_cache_refresh() instead, where it unpins and memunmaps the page. (We'd still tell the *kernel* by calling kvm_set_pfn_dirty() immediately in the invalidation, just defer the KVM mark_page_dirty() to be done in the context of an actual vCPU.) If I do that, I can drop this 'propagate vcpu' patch completely. I'm already happy enough that I'll fix the Xen shared_info page by *never* bothering to dirty_log for it, and I can wash my hands of that problem.... if I stop typing now. But... That leaves the one in TDP MMU handle_changed_spte_dirty_log() which AFAICT can trigger the same crash seen by butt3rflyh4ck — can't that happen from a thread where kvm_get_running_vcpu() is NULL too? For that one I'm not sure.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature