* Peter Xu <peterx@xxxxxxxxxx> [230516 16:12]: ... > > > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > --- > > > fs/userfaultfd.c | 27 +++++++++++++++++++++------ > > > 1 file changed, 21 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index 0fd96d6e39ce..7eb88bc74d00 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > BUG_ON(!found); > > > > > > vma_iter_set(&vmi, start); > > > - prev = vma_prev(&vmi); > > > + vma = vma_find(&vmi, end); > > > + if (!vma) > > > + goto out_unlock; > > > + > > > + if (vma->vm_start < start) > > > + prev = vma; > > > + else > > > + prev = vma_prev(&vmi); > > > > > > ret = 0; > > > - for_each_vma_range(vmi, vma, end) { > > > + do { > > > > The iterator may be off by one, depending on if vma_prev() is called or > > not. > > > > Perhaps: > > prev = vma_prev(&vmi); /* could be wrong, or null */ > > vma = vma_find(&vmi, end); > > if (!vma) > > goto out_unlock; > > > > if (vma->vm_start < start) > > prev = vma; > > > > now we know we are at vma with the iterator.. > > ret = 0 > > do{ > > ... > > Will do, thanks. > > I think I got trapped similarly when I was looking at xarray months ago > where xarray also had similar side effects to have off-by-one the iterator > behavior. > > Do you think it'll make sense to have something describing such side > effects for maple tree (or the current vma api), or.. maybe there's already > some but I just didn't really know? Well, it's an iterator so I though a position was implied. But I think the documentation is lacking on the vma_iterator interface and I should fix that. ... > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > Date: Tue, 16 May 2023 09:39:38 -0400 > > > Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible > > > > > > We used to not pass in the pgoff correctly when register/unregister uffd > > > regions, it caused incorrect behavior on vma merging. > > > > > > For example, when we have: > > > > > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > > > > > Then someone unregisters uffd on range (5-9), it should become: > > > > > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > > > > > But with current code it's: > > > > > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > > > > > This patch allows such merge to happen correctly. > > > > > > This behavior seems to have existed since the 1st day of uffd, keep it just > > > as a performance optmization and not copy stable. > > > > > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > --- > > > fs/userfaultfd.c | 8 ++++++-- > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > index 7eb88bc74d00..891048b4799f 100644 > > > --- a/fs/userfaultfd.c > > > +++ b/fs/userfaultfd.c > > > @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > bool basic_ioctls; > > > unsigned long start, end, vma_end; > > > struct vma_iterator vmi; > > > + pgoff_t pgoff; > > > > > > user_uffdio_register = (struct uffdio_register __user *) arg; > > > > > > @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > vma_end = min(end, vma->vm_end); > > > > > > new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; > > > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > > > I don't think this is safe. You are telling vma_merge() something that > > is not true and will result in can_vma_merge_before() passing. I mean, > > sure it will become true after you split (unless you can't?), but I > > don't know if you can just merge a VMA that doesn't pass > > can_vma_merge_before(), even for a short period? > > I must admit I'm not really that handy yet to vma codes, so I could miss > something obvious. > > My reasoning comes from two parts that this pgoff looks all fine: > > 1) It's documented in vma_merge() in that: > > * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), > * figure out ... > > So fundamentally this pgoff is part of the mapping request paired with > all the rest of the information. AFAICT it means it must match with what > "addr" is describing in VA address space. That's why I think offseting > it makes sense here. > > It also matches my understanding in vma_merge() code on how the pgoff is > used. > > 2) Uffd is nothing special in this regard, namely: > > mbind_range(): > > pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT); > merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags, > vma->anon_vma, vma->vm_file, pgoff, new_pol, > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > mlock_fixup(): > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > *prev = vma_merge(vmi, mm, *prev, start, end, newflags, > vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > mprotect_fixup(): > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > *pprev = vma_merge(vmi, mm, *pprev, start, end, newflags, > vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > I had a feeling that it's just something overlooked in the initial proposal > of uffd, but maybe I missed something important? I think you are correct. It's worth noting that all of these skip splitting if merging succeeds. We know it won't match case 1-4 (we have a current vma). We then pass in vma_end = min(end, vma->vm_end); vma_lookup() will only be called if end == vma->vm_end, so next will not be set (and found) unless it is adjacent to the current vma and the vma in question does not need to be split anyways. I also see that we use pgoff+pglen in the check, which avoids my concern above. > > > > > > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > > + vma->anon_vma, vma->vm_file, pgoff, > > > vma_policy(vma), > > > ((struct vm_userfaultfd_ctx){ ctx }), > > > anon_vma_name(vma)); > > > @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > unsigned long start, end, vma_end; > > > const void __user *buf = (void __user *)arg; > > > struct vma_iterator vmi; > > > + pgoff_t pgoff; > > > > > > ret = -EFAULT; > > > if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister))) > > > @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > uffd_wp_range(vma, start, vma_end - start, false); > > > > > > new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; > > > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > > + vma->anon_vma, vma->vm_file, pgoff, > > > vma_policy(vma), > > > NULL_VM_UFFD_CTX, anon_vma_name(vma)); > > > if (prev) { > > > -- > > > 2.39.1 > > > ---8<--- > > > > > > -- > > > Peter Xu > > > > > > > -- > Peter Xu >