On Wed, 12 Mar 2025 10:51:31 -0400 Peter Xu <peterx@xxxxxxxxxx> wrote: > This patch should fix a possible userfaultfd release() hang during > concurrent GUP. > > This problem was initially reported by Dimitris Siakavaras in July 2023 [1] > in a firecracker use case. Firecracker has a separate process handling > page faults remotely, and when the process releases the userfaultfd it can > race with a concurrent GUP from KVM trying to fault in a guest page during > the secondary MMU page fault process. > > A similar problem was reported recently again by Jinjiang Tu in March 2025 > [2], even though the race happened this time with a mlockall() operation, > which does GUP in a similar fashion. > > In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the > uffd without triggering SIGBUS") was trying to fix this issue. AFAIU, that > fixes well the fault paths but may not work yet for GUP. In GUP, the issue > is NOPAGE will be almost treated the same as "page fault resolved" in > faultin_page(), then the GUP will follow page again, seeing page missing, > and it'll keep going into a live lock situation as reported. > > This change makes core mm return RETRY instead of NOPAGE for both the GUP > and fault paths, proactively releasing the mmap read lock. This should > guarantee the other release thread make progress on taking the write lock > and avoid the live lock even for GUP. > > When at it, rearrange the comments to make sure it's uptodate. It would be good to have a Fixes: target but this bug seems to be so old that a bare cc:stable should be OK?