On Tue, May 16, 2023 at 11:06:40AM -0400, Peter Xu wrote: > On Tue, May 16, 2023 at 12:07:11AM +0100, Lorenzo Stoakes wrote: > > On Mon, May 15, 2023 at 11:04:27PM +0100, Lorenzo Stoakes wrote: > > [snip] > > > > Could you explain a bit why we don't need to merge in this case? > > > > > > > > I'm considering, for example, when we have: > > > > > > > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > > > > > > > Then someone unregisters uffd on range (5-9), iiuc it should become: > > > > > > > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > > > > > > > But if no merge here it's: > > > > > > > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > > > > > > > Maybe I missed something? > > > > > > > > > > There's something really, really wrong with this. It simply isn't valid to > > > invoke vma_merge() over an existing VMA that != prev where you're not > > > specifying addr = vma->vm_start, end == vma->vm_end. > > > > > > This seems like you're relying on:- > > > > > > *** > > > CCCCCNNNNN -> CCNNNNNNNN > > I had a closer look today, I still think this patch is not really the right > one. The split/merge order is something we use everywhere and I am not > convinced it must change as drastic. At least so far it still seems to me > if we do with what current patch proposed we can have vma fragmentations. 'something we use everywhere' is not an argument (speak to Willy about folios), vma_merge() expects valid input, relying on it _happening_ to be ok or to fail in ways that _happen_ not to cause big problems is not right. This is just further evidence that the vma_merge() interface is fundamentally broken. Implicitly assuming you will only get a partial prev overlap merge is far from intuitive. I am definitely going to try to do a series addressing vma_merge() horrors because I feel like we need a generic means of doing this split/merge pattern. > > I think I see what you meant, but here I think it's a legal case where we > should have PPPP rather than CCCC (PPPPPNNNN --> PPNNNNNNNN). > > To be explicit, for register I think it _should_ be the case 0 where we > cannot merge (note: the current code is indeed wrong though, see below): > > **** > PPPPPPNNNNNN > cannot merge > > While for the unregister case here it's case 4: > > **** > PPPPPPNNNNNN > might become > PPNNNNNNNNNN > case 4 below > > Here the problem is not that we need to do split then merge (I think it'll > have the problem of vma defragmentation here), the problem is we simply > passed in the wrong "prev" vma pointer, IMHO. I've patches attached > showing what I meant. Yeah if you do it with prev = vma this form should _probably_ work, that's a good point. This _is_ a case (see https://ljs.io/vma_merge_cases.png for nice diagram of cases btw :), like 5, where we actually do split and merge at the same time. Liam's raised some issues with the safety of your patches, let me look at them when I get a chance, nasty headcold = brain less functional atm. I would say for now this fix resolves the issue in a way that should emphatically avoid invalid input to vma_merge(), the fragmentation existed before so this is not a new issue, so for the time being I think it's ok to stay as-is. > > I checked the original commit from Andrea and I found that it _was_ correct: > > commit 86039bd3b4e6a1129318cbfed4e0a6e001656635 > Author: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Date: Fri Sep 4 15:46:31 2015 -0700 > > userfaultfd: add new syscall to provide memory externalization > > I had a feeling that it's broken during the recent rework on vma (or maybe > even not that close), but I'm not yet sure and need to further check. > > > > > > > By specifying parameters that are compatible with N even though you're only > > > partially spanning C? > > > > > > This is crazy, and isn't how this should be used. vma_merge() is not > > > supposed to do partial merges. If it works (presumably it does) this is not > > > by design unless I've lost my mind and I (and others) have somehow not > > > noticed this?? > > > > > > I think you're right that now we'll end up with more fragmentation, but > > > what you're suggesting is not how vma_merge() is supposed to work. > > > > > > As I said above, giving vma_merge() invalid parameters is very dangerous as > > > you could end up merging over empty ranges in theory (and could otherwise > > > have corruption). > > > > > > I guess we should probably be passing 0 to the last parameter in > > > split_vma() here then to ensure we do a merge pass too. Will experiment > > > with this. > > > > > > I'm confused as to how the remove from case 8 is not proceeding. I'll look > > > into this some more... > > > > > > Happy to be corrected if I'm misconstruing this! > > > > > > > OK, so I wrote a small program to do perform exactly this case [0] and it seems > > that the outcome is the same before and after this patch - vma_merge() is > > clearly rejecting the case 8 merge (phew!) and in both instances you end up with > > 3 VMAs. > > > > So this patch doesn't change this behaviour and everything is as it was > > before. Ideally we'd let it go for another pass, so maybe we should change the > > split to add a new VMA _afterwards_. Will experiment with that, separately. > > > > But looks like the patch is good as it is. > > > > (if you notice something wrong with the repro, etc. do let me know!) > > > > [0]: https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e > > I think your test case is fine, but... no, this is still not expected. The > vma should just merge. > > So I have another closer look on this specific issue, it seems we have a > long standing bug on pgoff calculation here when passing that to > vma_merge(). I've got another patch attached to show what I meant. > > To summarize.. now I've got two patches attached: > > Patch 1 is something I'd like to propose to replace this patch that fixes > incorrect use of vma_merge() so it should also eliminate the assertion > being triggered (I still think this is a regression but I need to check > which I will do later; I'm not super familiar with maple tree work, maybe > you or Liam can quickly spot). > > Patch 2 fixes the long standing issue of vma not being able to merge in > above case, and with patch 2 applied it should start merging all right. > > Please have a look, thanks. > > ---8<--- > From 6bc39028bba246394bb0bafdaeaab7b8dfd1694f Mon Sep 17 00:00:00 2001 > From: Peter Xu <peterx@xxxxxxxxxx> > Date: Tue, 16 May 2023 09:03:22 -0400 > Subject: [PATCH 1/2] mm/uffd: Fix vma operation where start addr cuts part of > vma > > It seems vma merging with uffd paths is broken with either > register/unregister, where right now we can feed wrong parameters to > vma_merge() and it's found by recent patch which moved asserts upwards in > vma_merge(): > > https://lore.kernel.org/all/ZFunF7DmMdK05MoF@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > The problem is in the current code base we didn't fixup "prev" for the case > where "start" address can be within the "prev" vma section. In that case > we should have "prev" points to the current vma rather than the previous > one when feeding to vma_merge(). > > This will eliminate the report and make sure vma_merge() calls will become > legal again. > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > fs/userfaultfd.c | 27 +++++++++++++++++++++------ > 1 file changed, 21 insertions(+), 6 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 0fd96d6e39ce..7eb88bc74d00 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > BUG_ON(!found); > > vma_iter_set(&vmi, start); > - prev = vma_prev(&vmi); > + vma = vma_find(&vmi, end); > + if (!vma) > + goto out_unlock; > + > + if (vma->vm_start < start) > + prev = vma; > + else > + prev = vma_prev(&vmi); > > ret = 0; > - for_each_vma_range(vmi, vma, end) { > + do { > cond_resched(); > > BUG_ON(!vma_can_userfault(vma, vm_flags)); > @@ -1517,7 +1524,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > skip: > prev = vma; > start = vma->vm_end; > - } > + } for_each_vma_range(vmi, vma, end); > > out_unlock: > mmap_write_unlock(mm); > @@ -1624,9 +1631,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > BUG_ON(!found); > > vma_iter_set(&vmi, start); > - prev = vma_prev(&vmi); > + vma = vma_find(&vmi, end); > + if (!vma) > + goto out_unlock; > + > + if (vma->vm_start < start) > + prev = vma; > + else > + prev = vma_prev(&vmi); > + > ret = 0; > - for_each_vma_range(vmi, vma, end) { > + do { > cond_resched(); > > BUG_ON(!vma_can_userfault(vma, vma->vm_flags)); > @@ -1692,7 +1707,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > skip: > prev = vma; > start = vma->vm_end; > - } > + } for_each_vma_range(vmi, vma, end); > > out_unlock: > mmap_write_unlock(mm); > -- > 2.39.1 > > From bf61f3c937e9e2ab96ab2bed0005cb7dc74db231 Mon Sep 17 00:00:00 2001 > From: Peter Xu <peterx@xxxxxxxxxx> > Date: Tue, 16 May 2023 09:39:38 -0400 > Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible > > We used to not pass in the pgoff correctly when register/unregister uffd > regions, it caused incorrect behavior on vma merging. > > For example, when we have: > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > Then someone unregisters uffd on range (5-9), it should become: > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > But with current code it's: > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > This patch allows such merge to happen correctly. > > This behavior seems to have existed since the 1st day of uffd, keep it just > as a performance optmization and not copy stable. > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > --- > fs/userfaultfd.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 7eb88bc74d00..891048b4799f 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > bool basic_ioctls; > unsigned long start, end, vma_end; > struct vma_iterator vmi; > + pgoff_t pgoff; > > user_uffdio_register = (struct uffdio_register __user *) arg; > > @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > vma_end = min(end, vma->vm_end); > > new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > + vma->anon_vma, vma->vm_file, pgoff, > vma_policy(vma), > ((struct vm_userfaultfd_ctx){ ctx }), > anon_vma_name(vma)); > @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > unsigned long start, end, vma_end; > const void __user *buf = (void __user *)arg; > struct vma_iterator vmi; > + pgoff_t pgoff; > > ret = -EFAULT; > if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister))) > @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > uffd_wp_range(vma, start, vma_end - start, false); > > new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > + vma->anon_vma, vma->vm_file, pgoff, > vma_policy(vma), > NULL_VM_UFFD_CTX, anon_vma_name(vma)); > if (prev) { > -- > 2.39.1 > ---8<--- > > -- > Peter Xu >