Hi! I haven't tested this and might be missing something, but I think that the MMU notifier for KVM_GUEST_USES_PFN pfncache is currently a bit broken. Except that nothing seems to actually use KVM_GUEST_USES_PFN, so currently it's not actually a problem? gfn_to_pfn_cache_invalidate_start() contains the following: /* * If the OOM reaper is active, then all vCPUs should have * been stopped already, so perform the request without * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd. */ if (!may_block) req &= ~KVM_REQUEST_WAIT; called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); WARN_ON_ONCE(called && !may_block); The comment explains that we rely on OOM reaping only happening when a process is sufficiently far into being stopped that it is no longer executing vCPUs, but from what I can tell, that's not what the caller actually guarantees. Especially on the path from the process_mrelease() syscall (if we're dealing with a process whose mm is not shared with other processes), we only check that the target process has SIGNAL_GROUP_EXIT set. From what I can tell, that does imply that delivery of a fatal signal has begun, but doesn't even imply that the CPU running the target process has been IPI'd, let alone that the target process has died or anything like that. But I also don't see any reason why gfn_to_pfn_cache_invalidate_start() actually has to do anything special for non-blocking invalidation - from what I can tell, nothing in there can block, basically everything runs with preemption disabled. The first half of the function holds a spinlock; the second half is basically a call to kvm_make_vcpus_request_mask(), which disables preemption across the whole function with get_cpu()/put_cpu(). A synchronous IPI spins until the IPI has been acked but that doesn't count as sleeping. (And the rest of the OOM reaping code will do stuff like synchronous IPIs for its TLB flushes, too.) So maybe you/I can just rip out the special-casing of nonblocking mode from gfn_to_pfn_cache_invalidate_start() to fix this? Relevant call paths for the theoretical race: sys_kill prepare_kill_siginfo kill_something_info kill_proc_info rcu_read_lock kill_pid_info rcu_read_lock group_send_sig_info [PIDTYPE_TGID] do_send_sig_info lock_task_sighand [task->sighand->siglock] send_signal_locked __send_signal_locked prepare_signal legacy_queue signalfd_notify sigaddset(&pending->signal, sig) complete_signal signal->flags = SIGNAL_GROUP_EXIT [mrelease will work starting here] for each thread: sigaddset(&t->pending.signal, SIGKILL) signal_wake_up [IPI happens here] unlock_task_sighand [task->sighand->siglock] rcu_read_unlock rcu_read_unlock sys_process_mrelease find_lock_task_mm spin_lock(&p->alloc_lock) task_will_free_mem SIGNAL_GROUP_EXIT suffices PF_EXITING suffices if singlethreaded? task_unlock mmap_read_lock_killable __oom_reap_task_mm for each private non-PFNMAP/MIXED VMA: tlb_gather_mmu mmu_notifier_invalidate_range_start_nonblock __mmu_notifier_invalidate_range_start mn_hlist_invalidate_range_start kvm_mmu_notifier_invalidate_range_start [as ops->invalidate_range_start] gfn_to_pfn_cache_invalidate_start [loop over gfn_to_pfn_cache instances] if overlap and KVM_GUEST_USES_PFN [UNUSED]: evict_vcpus=true [if evict_vcpus] kvm_make_vcpus_request_mask __kvm_handle_hva_range unmap_page_range mmu_notifier_invalidate_range_end tlb_finish_mmu mmap_read_unlock