KVM nonblocking MMU notifier with KVM_GUEST_USES_PFN looks racy [but is currently unused]

Jann Horn <jannh@xxxxxxxxxx> · Mon, 18 Sep 2023 19:12:56 +0200

Hi!

I haven't tested this and might be missing something, but I think that
the MMU notifier for KVM_GUEST_USES_PFN pfncache is currently a bit
broken. Except that nothing seems to actually use KVM_GUEST_USES_PFN,
so currently it's not actually a problem?

gfn_to_pfn_cache_invalidate_start() contains the following:

    /*
     * If the OOM reaper is active, then all vCPUs should have
     * been stopped already, so perform the request without
     * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd.
     */
    if (!may_block)
      req &= ~KVM_REQUEST_WAIT;

    called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap);

    WARN_ON_ONCE(called && !may_block);

The comment explains that we rely on OOM reaping only happening when a
process is sufficiently far into being stopped that it is no longer
executing vCPUs, but from what I can tell, that's not what the caller
actually guarantees. Especially on the path from the
process_mrelease() syscall (if we're dealing with a process whose mm
is not shared with other processes), we only check that the target
process has SIGNAL_GROUP_EXIT set. From what I can tell, that does
imply that delivery of a fatal signal has begun, but doesn't even
imply that the CPU running the target process has been IPI'd, let
alone that the target process has died or anything like that.

But I also don't see any reason why
gfn_to_pfn_cache_invalidate_start() actually has to do anything
special for non-blocking invalidation - from what I can tell, nothing
in there can block, basically everything runs with preemption
disabled. The first half of the function holds a spinlock; the second
half is basically a call to kvm_make_vcpus_request_mask(), which
disables preemption across the whole function with
get_cpu()/put_cpu(). A synchronous IPI spins until the IPI has been
acked but that doesn't count as sleeping. (And the rest of the OOM
reaping code will do stuff like synchronous IPIs for its TLB flushes,
too.)

So maybe you/I can just rip out the special-casing of nonblocking mode
from gfn_to_pfn_cache_invalidate_start() to fix this?

Relevant call paths for the theoretical race:

sys_kill
  prepare_kill_siginfo
  kill_something_info
    kill_proc_info
      rcu_read_lock
      kill_pid_info
        rcu_read_lock
        group_send_sig_info [PIDTYPE_TGID]
          do_send_sig_info
            lock_task_sighand [task->sighand->siglock]
            send_signal_locked
              __send_signal_locked
                prepare_signal
                legacy_queue
                signalfd_notify
                sigaddset(&pending->signal, sig)
                complete_signal
                  signal->flags = SIGNAL_GROUP_EXIT [mrelease will
work starting here]
                  for each thread:
                    sigaddset(&t->pending.signal, SIGKILL)
                    signal_wake_up [IPI happens here]
            unlock_task_sighand [task->sighand->siglock]
        rcu_read_unlock
      rcu_read_unlock

sys_process_mrelease
  find_lock_task_mm
    spin_lock(&p->alloc_lock)
  task_will_free_mem
    SIGNAL_GROUP_EXIT suffices
    PF_EXITING suffices if singlethreaded?
  task_unlock
  mmap_read_lock_killable
  __oom_reap_task_mm
    for each private non-PFNMAP/MIXED VMA:
      tlb_gather_mmu
      mmu_notifier_invalidate_range_start_nonblock
        __mmu_notifier_invalidate_range_start
          mn_hlist_invalidate_range_start
            kvm_mmu_notifier_invalidate_range_start [as
ops->invalidate_range_start]
              gfn_to_pfn_cache_invalidate_start
                [loop over gfn_to_pfn_cache instances]
                  if overlap and KVM_GUEST_USES_PFN [UNUSED]: evict_vcpus=true
                [if evict_vcpus]
                  kvm_make_vcpus_request_mask
              __kvm_handle_hva_range
      unmap_page_range
      mmu_notifier_invalidate_range_end
      tlb_finish_mmu
  mmap_read_unlock