On Tue, May 16, 2023 at 06:52:51PM -0400, Liam R. Howlett wrote: > * Peter Xu <peterx@xxxxxxxxxx> [230516 16:12]: > ... > > > > > > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > > --- > > > > fs/userfaultfd.c | 27 +++++++++++++++++++++------ > > > > 1 file changed, 21 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index 0fd96d6e39ce..7eb88bc74d00 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > BUG_ON(!found); > > > > > > > > vma_iter_set(&vmi, start); > > > > - prev = vma_prev(&vmi); > > > > + vma = vma_find(&vmi, end); > > > > + if (!vma) > > > > + goto out_unlock; > > > > + > > > > + if (vma->vm_start < start) > > > > + prev = vma; > > > > + else > > > > + prev = vma_prev(&vmi); > > > > > > > > ret = 0; > > > > - for_each_vma_range(vmi, vma, end) { > > > > + do { > > > > > > The iterator may be off by one, depending on if vma_prev() is called or > > > not. > > > > > > Perhaps: > > > prev = vma_prev(&vmi); /* could be wrong, or null */ > > > vma = vma_find(&vmi, end); > > > if (!vma) > > > goto out_unlock; > > > > > > if (vma->vm_start < start) > > > prev = vma; > > > > > > now we know we are at vma with the iterator.. > > > ret = 0 > > > do{ > > > ... > > > > Will do, thanks. > > > > I think I got trapped similarly when I was looking at xarray months ago > > where xarray also had similar side effects to have off-by-one the iterator > > behavior. > > > > Do you think it'll make sense to have something describing such side > > effects for maple tree (or the current vma api), or.. maybe there's already > > some but I just didn't really know? > > Well, it's an iterator so I though a position was implied. But I think > the documentation is lacking on the vma_iterator interface and I should > fix that. Thanks. > > ... > > > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > > Date: Tue, 16 May 2023 09:39:38 -0400 > > > > Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible > > > > > > > > We used to not pass in the pgoff correctly when register/unregister uffd > > > > regions, it caused incorrect behavior on vma merging. > > > > > > > > For example, when we have: > > > > > > > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > > > > > > > Then someone unregisters uffd on range (5-9), it should become: > > > > > > > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > > > > > > > But with current code it's: > > > > > > > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > > > > > > > This patch allows such merge to happen correctly. > > > > > > > > This behavior seems to have existed since the 1st day of uffd, keep it just > > > > as a performance optmization and not copy stable. > > > > > > > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > > Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > > --- > > > > fs/userfaultfd.c | 8 ++++++-- > > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > index 7eb88bc74d00..891048b4799f 100644 > > > > --- a/fs/userfaultfd.c > > > > +++ b/fs/userfaultfd.c > > > > @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > bool basic_ioctls; > > > > unsigned long start, end, vma_end; > > > > struct vma_iterator vmi; > > > > + pgoff_t pgoff; > > > > > > > > user_uffdio_register = (struct uffdio_register __user *) arg; > > > > > > > > @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > vma_end = min(end, vma->vm_end); > > > > > > > > new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; > > > > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > > > > > I don't think this is safe. You are telling vma_merge() something that > > > is not true and will result in can_vma_merge_before() passing. I mean, > > > sure it will become true after you split (unless you can't?), but I > > > don't know if you can just merge a VMA that doesn't pass > > > can_vma_merge_before(), even for a short period? > > > > I must admit I'm not really that handy yet to vma codes, so I could miss > > something obvious. > > > > My reasoning comes from two parts that this pgoff looks all fine: > > > > 1) It's documented in vma_merge() in that: > > > > * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), > > * figure out ... > > > > So fundamentally this pgoff is part of the mapping request paired with > > all the rest of the information. AFAICT it means it must match with what > > "addr" is describing in VA address space. That's why I think offseting > > it makes sense here. > > > > It also matches my understanding in vma_merge() code on how the pgoff is > > used. > > > > 2) Uffd is nothing special in this regard, namely: > > > > mbind_range(): > > > > pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT); > > merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags, > > vma->anon_vma, vma->vm_file, pgoff, new_pol, > > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > > > mlock_fixup(): > > > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > *prev = vma_merge(vmi, mm, *prev, start, end, newflags, > > vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > > > mprotect_fixup(): > > > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > *pprev = vma_merge(vmi, mm, *pprev, start, end, newflags, > > vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > > vma->vm_userfaultfd_ctx, anon_vma_name(vma)); > > > > I had a feeling that it's just something overlooked in the initial proposal > > of uffd, but maybe I missed something important? > > I think you are correct. It's worth noting that all of these skip > splitting if merging succeeds. Yes, IIUC that's what we want because vma_merge() just handles everything there (including split, or say, vma range adjustments) if any !NULL returned. > > We know it won't match case 1-4 (we have a current vma). We then pass > in vma_end = min(end, vma->vm_end); Case 4 seems still possible and should be the case that mentioned in the patch 2, iiuc. But yes I think vma_end calculation is needed, afaik it is to cover the last iteration, where that's the only place possible that we may operate on "end" (where < vma->vm_end) rather than "vma->vm_end". It actually pairs with the initial "start" adjustment to me. > > vma_lookup() will only be called if end == vma->vm_end, so next will not > be set (and found) unless it is adjacent to the current vma and the vma > in question does not need to be split anyways. > > I also see that we use pgoff+pglen in the check, which avoids my concern > above. Right. It seems so far all concerns are more or less ruled out. I'll prepare a formal patchset, we can continue the discussion there. Thanks, -- Peter Xu