caOn Fri, Dec 10, 2021 at 3:05 PM Ignat Korchagin <ignat@xxxxxxxxxxxxxx> wrote: > > I've been trying to figure out the difference between "good" runs and > "bad" runs of gvisor. So, if I've been running the following bpftrace > onliner: > > $ bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }' > > while also executing a single: > > $ sudo runsc --platform=kvm --network=none do echo ok > > So, for "good" runs the stacks are the following: The stacks help, thanks for including them. It seems like a race during do_exit teardown. One thing I notice is that do_exit->mmput->kvm_mmu_zap_all can interleave with kvm_vcpu_release->kvm_tdp_mmu_put_root (full call chains omitted), since the former path allows yielding. But I don't yet see that could lead to any issues, let alone cause us to encounter a PFN in the EPT with a zero refcount. I'll take a closer look next week. > > # bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }' > Attaching 1 probe... > ^C > > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > kvm_tdp_mmu_unmap_gfn_range+331 > kvm_unmap_gfn_range+774 > kvm_mmu_notifier_invalidate_range_start+743 > __mmu_notifier_invalidate_range_start+508 > unmap_vmas+566 > unmap_region+494 > __do_munmap+1172 > __vm_munmap+226 > __x64_sys_munmap+98 > do_syscall_64+64 > entry_SYSCALL_64_after_hwframe+68 > ]: 1 > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > kvm_tdp_mmu_unmap_gfn_range+331 > kvm_unmap_gfn_range+774 > kvm_mmu_notifier_invalidate_range_start+743 > __mmu_notifier_invalidate_range_start+508 > zap_page_range_single+870 > unmap_mapping_pages+434 > shmem_fallocate+2518 > vfs_fallocate+684 > __x64_sys_fallocate+181 > do_syscall_64+64 > entry_SYSCALL_64_after_hwframe+68 > ]: 32 > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > __kvm_tdp_mmu_zap_gfn_range+162 > kvm_tdp_mmu_zap_all+34 > kvm_mmu_zap_all+518 > kvm_mmu_notifier_release+83 > __mmu_notifier_release+420 > exit_mmap+965 > mmput+167 > do_exit+2482 > do_group_exit+236 > get_signal+1000 > arch_do_signal_or_restart+580 > exit_to_user_mode_prepare+300 > syscall_exit_to_user_mode+25 > do_syscall_64+77 > entry_SYSCALL_64_after_hwframe+68 > ]: 365 > > For "bad" runs, when I get the warning - I get this: > > # bpftrace -e 'kprobe:kvm_set_pfn_dirty { @[kstack] = count(); }' > Attaching 1 probe... > ^C > > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > kvm_tdp_mmu_unmap_gfn_range+331 > kvm_unmap_gfn_range+774 > kvm_mmu_notifier_invalidate_range_start+743 > __mmu_notifier_invalidate_range_start+508 > unmap_vmas+566 > unmap_region+494 > __do_munmap+1172 > __vm_munmap+226 > __x64_sys_munmap+98 > do_syscall_64+64 > entry_SYSCALL_64_after_hwframe+68 > ]: 1 > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > kvm_tdp_mmu_put_root+465 > mmu_free_root_page+537 > kvm_mmu_free_roots+629 > kvm_mmu_unload+28 > kvm_arch_destroy_vm+510 > kvm_put_kvm+1017 > kvm_vcpu_release+78 > __fput+516 > task_work_run+206 > do_exit+2615 > do_group_exit+236 > get_signal+1000 > arch_do_signal_or_restart+580 > exit_to_user_mode_prepare+300 > syscall_exit_to_user_mode+25 > do_syscall_64+77 > entry_SYSCALL_64_after_hwframe+68 > ]: 2 > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > kvm_tdp_mmu_unmap_gfn_range+331 > kvm_unmap_gfn_range+774 > kvm_mmu_notifier_invalidate_range_start+743 > __mmu_notifier_invalidate_range_start+508 > zap_page_range_single+870 > unmap_mapping_pages+434 > shmem_fallocate+2518 > vfs_fallocate+684 > __x64_sys_fallocate+181 > do_syscall_64+64 > entry_SYSCALL_64_after_hwframe+68 > ]: 32 > @[ > kvm_set_pfn_dirty+1 > __handle_changed_spte+2535 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __handle_changed_spte+1746 > __tdp_mmu_set_spte+396 > zap_gfn_range+2229 > __kvm_tdp_mmu_zap_gfn_range+162 > kvm_tdp_mmu_zap_all+34 > kvm_mmu_zap_all+518 > kvm_mmu_notifier_release+83 > __mmu_notifier_release+420 > exit_mmap+965 > mmput+167 > do_exit+2482 > do_group_exit+236 > get_signal+1000 > arch_do_signal_or_restart+580 > exit_to_user_mode_prepare+300 > syscall_exit_to_user_mode+25 > do_syscall_64+77 > entry_SYSCALL_64_after_hwframe+68 > ]: 344 > > That is, I never get a stack with > kvm_tdp_mmu_put_root->..->kvm_set_pfn_dirty with a "good" run. > Perhaps, this may shed some light onto what is going on. > > Ignat