Hi, On 04/01/2024 14:42, Friedrich Weber wrote: > f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and > mmu_notifier unmap") As this commit mentions tdp_mmu_zap_leafs, I ran the reproducer again on 610a9b8f (6.7-rc8) while tracing >500ms calls to that function (printing a stacktrace) and task_numa_work via a bpftrace script [1]. Again, there are several invocations of task_numa_work that take >500ms (the VM appears to be frozen during that time). For instance, one invocation that takes ~20 seconds: [1704971602] task_numa_work (tid=52035) took 19995 ms For this particular thread and in the 20 seconds before that, there were 8 invocations of tdp_mmu_zap_leafs with >500ms: [1704971584] tdp_mmu_zap_leafs (tid=52035) took 2291 ms [1704971586] tdp_mmu_zap_leafs (tid=52035) took 2343 ms [1704971589] tdp_mmu_zap_leafs (tid=52035) took 2316 ms [1704971590] tdp_mmu_zap_leafs (tid=52035) took 1663 ms [1704971591] tdp_mmu_zap_leafs (tid=52035) took 682 ms [1704971594] tdp_mmu_zap_leafs (tid=52035) took 2706 ms [1704971597] tdp_mmu_zap_leafs (tid=52035) took 3132 ms [1704971602] tdp_mmu_zap_leafs (tid=52035) took 4846 ms They roughly sum up to 20s. The stacktrace is the same for all: bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345 bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345 bpf_trampoline_380104735946+208 tdp_mmu_zap_leafs+5 kvm_unmap_gfn_range+347 kvm_mmu_notifier_invalidate_range_start+394 __mmu_notifier_invalidate_range_start+156 change_protection+3908 change_prot_numa+105 task_numa_work+1029 bpf_trampoline_6442457341+117 task_numa_work+9 xfer_to_guest_mode_handle_work+261 kvm_arch_vcpu_ioctl_run+1553 kvm_vcpu_ioctl+667 __x64_sys_ioctl+164 do_syscall_64+96 entry_SYSCALL_64_after_hwframe+110 AFAICT this pattern repeats several times. I uploaded the last 150kb of the bpftrace output to [2]. I can provide the full output if needed. To me, not knowing much about the KVM/KSM/NUMA balancer interplay, this looks like task_numa_work triggers several invocations of tdp_mmu_zap_leafs, each of which takes an unusually long time. If anyone has a hunch why this might happen, or an idea where to look next, it would be much appreciated. Best, Friedrich [1] kfunc:tdp_mmu_zap_leafs { @start_zap[tid] = nsecs; } kretfunc:tdp_mmu_zap_leafs /@start_zap[tid]/ { $diff = nsecs - @start_zap[tid]; if ($diff > 500000000) { // 500ms time("[%s] "); printf("tdp_mmu_zap_leafs (tid=%d) took %d ms\n", tid, $diff / 1000000); print(kstack()); } delete(@start_zap[tid]); } kfunc:task_numa_work { @start_numa[tid] = nsecs; } kretfunc:task_numa_work /@start_numa[tid]/ { $diff = nsecs - @start_numa[tid]; if ($diff > 500000000) { // 500ms time("[%s] "); printf("task_numa_work (tid=%d) took %d ms\n", tid, $diff / 1000000); } delete(@start_numa[tid]); } [2] https://paste.debian.net/1303767/