On Sat, 22 Jul 2023 00:51:07 +0200 Jann Horn <jannh@xxxxxxxxxx> wrote: > mm->mm_lock_seq effectively functions as a read/write lock; therefore it > must be used with acquire/release semantics. > > A specific example is the interaction between userfaultfd_register() and > lock_vma_under_rcu(). > userfaultfd_register() does the following from the point where it changes > a VMA's flags to the point where concurrent readers are permitted again > (in a simple scenario where only a single private VMA is accessed and no > merging/splitting is involved): > > userfaultfd_register > userfaultfd_set_vm_flags > vm_flags_reset > vma_start_write > down_write(&vma->vm_lock->lock) > vma->vm_lock_seq = mm_lock_seq [marks VMA as busy] > up_write(&vma->vm_lock->lock) > vm_flags_init > [sets VM_UFFD_* in __vm_flags] > vma->vm_userfaultfd_ctx.ctx = ctx > mmap_write_unlock > vma_end_write_all > WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA] > > There are no memory barriers in between the __vm_flags update and the > mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered > to above the `vm_flags_init()` call, which means from the perspective of a > concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not > VMA-locked. That's bad, we definitely need a store-release for the unlock > operation. > > The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly > fine because all accesses to vma->vm_lock_seq that matter are always > protected by the VMA lock. There is a racy read in vma_start_read() though > that can tolerate false-positives, so we should be using WRITE_ONCE() to > keep things tidy and data-race-free (including for KCSAN). > > On the other side, lock_vma_under_rcu() works as follows in the relevant > region for locking and userfaultfd check: > > lock_vma_under_rcu > vma_start_read > vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout] > down_read_trylock(&vma->vm_lock->lock) > vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check] > userfaultfd_armed > checks vma->vm_flags & __VM_UFFD_FLAGS > > Here, the interesting aspect is how far down the mm->mm_lock_seq read > can be reordered - if this read is reordered down below the vma->vm_flags > access, this could cause lock_vma_under_rcu() to partly operate on > information that was read while the VMA was supposed to be locked. > To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we > need to read it with a load-acquire. > > Some of the comment wording is based on suggestions by Suren. > > BACKPORT WARNING: One of the functions changed by this patch (which I've > written against Linus' tree) is vma_try_start_write(), but this function > no longer exists in mm/mm-everything. I don't know whether the merged > version of this patch will be ordered before or after the patch that > removes vma_try_start_write(). If you're backporting this patch to a > tree with vma_try_start_write(), make sure this patch changes that > function. I staged this patch as a hotfix, ahead of mm-unstable material. The conflict is with Hugh's "mm: delete mmap_write_trylock() and vma_try_start_write()" (https://lkml.kernel.org/r/4e6db3d-e8e-73fb-1f2a-8de2dab2a87c@xxxxxxxxxx) I fixed the reject in the obvious way (deleted the function anyway), but there's a possibility that the ordering issue you have addressed will now be reintroduced by Hugh's series, so please let's review that.