On 12/11/21 02:34, David Matlack wrote:
The stacks help, thanks for including them. It seems like a race
during do_exit teardown. One thing I notice is that
do_exit->mmput->kvm_mmu_zap_all can interleave with
kvm_vcpu_release->kvm_tdp_mmu_put_root (full call chains omitted),
since the former path allows yielding. But I don't yet see that could
lead to any issues, let alone cause us to encounter a PFN in the EPT
with a zero refcount.
Can it? The call chains are
zap_gfn_range+2229
kvm_tdp_mmu_put_root+465
kvm_mmu_free_roots+629
kvm_mmu_unload+28
kvm_arch_destroy_vm+510
kvm_put_kvm+1017
kvm_vcpu_release+78
__fput+516
task_work_run+206
do_exit+2615
do_group_exit+236
and
zap_gfn_range+2229
__kvm_tdp_mmu_zap_gfn_range+162
kvm_tdp_mmu_zap_all+34
kvm_mmu_zap_all+518
kvm_mmu_notifier_release+83
__mmu_notifier_release+420
exit_mmap+965
mmput+167
do_exit+2482
do_group_exit+236
but there can be no parallelism or interleaving here, because the call
to kvm_vcpu_release() is scheduled in exit_files() (and performed in
exit_task_work()). That comes after exit_mm(), where mmput() is called.
Even if the two could interleave, they go through the same zap_gfn_range
path. That path takes the lock for write and only yields on the 512
top-level page structures. Anything below is handled by
tdp_mmu_set_spte's (with mutual recursion between handle_changed_spte
and handle_removed_tdp_mmu_page), and there are no yields on that path.
Paolo