On Tue, Oct 24, 2023, Xu Yilun wrote: > On Wed, Oct 18, 2023 at 01:46:23PM -0700, Sean Christopherson wrote: > > @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) > > list_first_entry(&vcpu->async_pf.done, > > typeof(*work), link); > > list_del(&work->link); > > + > > + spin_unlock(&vcpu->async_pf.lock); > > + > > + /* > > + * The async #PF is "done", but KVM must wait for the work item > > + * itself, i.e. async_pf_execute(), to run to completion. If > > + * KVM is a module, KVM must ensure *no* code owned by the KVM > > + * (the module) can be run after the last call to module_put(), > > + * i.e. after the last reference to the last vCPU's file is put. > > + */ > > + flush_work(&work->work); > > I see the flush_work() is inside the check: > > while (!list_empty(&vcpu->async_pf.done)) > > Is it possible all async_pf are already completed but the work item, > i.e. async_pf_execute, is not completed before this check? That the > work is scheduled out after kvm_arch_async_page_present_queued() and > all APF_READY requests have been handled. In this case the work > synchronization will be skipped... Good gravy. Yes, I assumed KVM wouldn't be so crazy to delete the work before it completed, but I obviously didn't see this comment in async_pf_execute(): /* * apf may be freed by kvm_check_async_pf_completion() after * this point */ The most straightforward fix I see is to also flush the work in kvm_check_async_pf_completion(), and then delete the comment. The downside is that there's a small chance a vCPU could be delayed waiting for the work to complete, but that's a very, very small chance, and likely a very small delay. kvm_arch_async_page_present_queued() unconditionaly makes a new request, i.e. will effectively delay entering the guest, so the remaining work is really just: trace_kvm_async_pf_completed(addr, cr2_or_gpa); __kvm_vcpu_wake_up(vcpu); mmput(mm); Since mmput() can't drop the last reference to the page tables if the vCPU is still alive. I think I'll spin off the async #PF fix to a separate series. There's are other tangetially related cleanups that can be done, e.g. there's no reason to pin the page tables while work is queued, async_pf_execute() can do mmget_not_zero() and then bail if the process is dying. Then there's no need to do mmput() when canceling work.