* Peter Xu <peterx@xxxxxxxxxx> [230516 17:40]: > On Tue, May 16, 2023 at 10:01:06PM +0100, Lorenzo Stoakes wrote: > > On Tue, May 16, 2023 at 04:30:11PM -0400, Peter Xu wrote: > > > On Tue, May 16, 2023 at 08:25:13PM +0100, Lorenzo Stoakes wrote: > > > > On Tue, May 16, 2023 at 11:06:40AM -0400, Peter Xu wrote: > > > > > On Tue, May 16, 2023 at 12:07:11AM +0100, Lorenzo Stoakes wrote: > > > > > > On Mon, May 15, 2023 at 11:04:27PM +0100, Lorenzo Stoakes wrote: > > > > > > [snip] > > > > > > > > Could you explain a bit why we don't need to merge in this case? > > > > > > > > > > > > > > > > I'm considering, for example, when we have: > > > > > > > > > > > > > > > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > > > > > > > > > > > > > > > Then someone unregisters uffd on range (5-9), iiuc it should become: > > > > > > > > > > > > > > > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > > > > > > > > > > > > > > > But if no merge here it's: > > > > > > > > > > > > > > > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > > > > > > > > > > > > > > > Maybe I missed something? > > > > > > > > > > > > > > > > > > > > > > There's something really, really wrong with this. It simply isn't valid to > > > > > > > invoke vma_merge() over an existing VMA that != prev where you're not > > > > > > > specifying addr = vma->vm_start, end == vma->vm_end. > > > > > > > > > > > > > > This seems like you're relying on:- > > > > > > > > > > > > > > *** > > > > > > > CCCCCNNNNN -> CCNNNNNNNN > > > > > > > > > > I had a closer look today, I still think this patch is not really the right > > > > > one. The split/merge order is something we use everywhere and I am not > > > > > convinced it must change as drastic. At least so far it still seems to me > > > > > if we do with what current patch proposed we can have vma fragmentations. > > > > > > > > 'something we use everywhere' is not an argument (speak to Willy about > > > > folios), vma_merge() expects valid input, relying on it _happening_ to be > > > > > > I still think it's an argument. > > > > > > I believe Matthew tried hard to justify he's correct in that aspect when > > > changing "something we used everywhere". Blindly referencing it doesn't > > > help much on a technical discussion. > > > > > > If we have similar code that handles similar things, IMHO you need a reason > > > to modify one of them to not like the other. > > > > > > If you think what you proposed is better, please consider (1) justify why > > > it's better, and then (2) also apply it to all the rest places where > > > applicable. Please refer to my reply to Liam on the other use cases of > > > vma_merge(). > > > > > > Said that, I apprecaite a lot on your assert patch that found this problem. > > > > > > > ok or to fail in ways that _happen_ not to cause big problems is not right. > > > > > > > > This is just further evidence that the vma_merge() interface is > > > > fundamentally broken. Implicitly assuming you will only get a partial prev > > > > overlap merge is far from intuitive. > > > > > > > > I am definitely going to try to do a series addressing vma_merge() horrors > > > > because I feel like we need a generic means of doing this split/merge pattern. > > > > > > It'll be great if you can make it better. > > > > > > > > > > > > > > > > > I think I see what you meant, but here I think it's a legal case where we > > > > > should have PPPP rather than CCCC (PPPPPNNNN --> PPNNNNNNNN). > > > > > > > > > > To be explicit, for register I think it _should_ be the case 0 where we > > > > > cannot merge (note: the current code is indeed wrong though, see below): > > > > > > > > > > **** > > > > > PPPPPPNNNNNN > > > > > cannot merge > > > > > > > > > > While for the unregister case here it's case 4: > > > > > > > > > > **** > > > > > PPPPPPNNNNNN > > > > > might become > > > > > PPNNNNNNNNNN > > > > > case 4 below > > > > > > > > > > Here the problem is not that we need to do split then merge (I think it'll > > > > > have the problem of vma defragmentation here), the problem is we simply > > > > > passed in the wrong "prev" vma pointer, IMHO. I've patches attached > > > > > showing what I meant. > > > > > > > > Yeah if you do it with prev = vma this form should _probably_ work, that's > > > > a good point. This _is_ a case (see https://ljs.io/vma_merge_cases.png for > > > > nice diagram of cases btw :), like 5, where we actually do split and merge > > > > at the same time. > > > > > > It's not something I came up with myself, it's just that I started looking > > > back to see what people did and trying to understand why they did it. > > > Normally people did things with good reasons. > > > > > > > Yup, it's an iffy pattern in each case. I can see why you chose to do it, > > it's not a criticism of that, it's just that the incorrect (but > > accidentally ok to be incorrect seemingly) input is made more obvious by > > recent changes. > > > > > So far it seems clearer at least to me to keep this pattern of "merge then > > > split". But I'm happy to be proven wrong anytime. > > > > But you're not, you're doing a vma_merge() and (without it being clear) > > hoping it does a merge AND SPLIT in case in an instance where that might be > > required followed by you doing any further splits required. > > I don't quite get the reason that you seem to keep saying this is > "incorrect input" to vma_merge(). vma_merge() definitely includes splits > too where it can, to be explicit, for case 4 & 5. > > It seems to me what you're trying to explain is we shouldn't handle any > split in vma_merge() so we should move cases 4 & 5 out of vma_merge(). If > we split first then merge, cases 4 & 5 will become case 2 & 3 after split. We don't split in case 4 or 5 - we adjust the existing VMA limits. We don't actually handle any splits in vma_merge(). I think splitting first would change 4 & 5 to 7 & 8? 2 & 3 would require a split and munmap, right? > My question would be: if it worked perfect in the past few years and it > looks all good enough, why bother.. I suspect, but it's not clear (like all of this), that the other arguments to vma_merge() is ruling out this potential hazard I thought existed. > > > > > But it's _not your fault_ that this is the standard approach, it's the > > interface that's wrong absolutely. > > > > To me doing this is quite a bit less clear than simply 'splitting so this > > is mergable first then try to merge' but obviously this current patch does > > not avoid the previously introduced fragmentation. However it does maintain > > existing behaviour. > > > > > > > > > > > > > Liam's raised some issues with the safety of your patches, let me look at > > > > them when I get a chance, nasty headcold = brain less functional atm. > > > > > > Sure, no need to rush. > > > > > > > > > > > I would say for now this fix resolves the issue in a way that should > > > > emphatically avoid invalid input to vma_merge(), the fragmentation existed > > > > before so this is not a new issue, so for the time being I think it's ok to > > > > stay as-is. > > > > > > So far I would still suggest having uffd code the same as the rest if > > > possible. > > > > > > > Absolutely, but in the immediate term, we are seeing VM_WARN_ON()'s here > > and not from other callers (which is actually surprising). > > Not surprising anymore to me, simply because the uffd path was the only one > got overlooked when converting to maple in commit 69dbe6daf104, as Liam > rightfully pointed out. > > For example, mprotect() has similar handling when the start addr of the > range can be in the middle of a vma, then in do_mprotect_pkey() there is: > > prev = vma_prev(&vmi); > if (start > vma->vm_start) > prev = vma; > > IMHO it's the same thing. > > > > > Again, we absolutely do need an abstracted solution for the whole. And > > that's something I'll try to work on! > > > > > I think I'll wait for the other discussion to settle on patch 2 to see > > > whether I should post a formal patchset. > > > > My suggestion is that it's ok to proceed as-is, not because this is the > > final change that should be applied, but rather because it resolves the > > issue while maintaining existing behaviour. > > > > I will be more than happy to see patches land after this one that replace > > it if necessary but I think it's safe for this to land as a mainline fixup, > > even if temporary, is all I am saying :) > > I'd hope we can replace this patch with my patch 1 if possible because I > _think_ this patch is still in hotfixes-unstable (while patch 2 doesn't > need to copy stable in all cases). Andrew may know better on how to > proceed. > > If this lands first, I'll probably propose a full revert otherwise as the > 1st patch of the patchset to post, then make uffd the same as others. > > Before that I'd like to know whether you agree that the new patch 1 (I'll > fixup the vma_prev() side effect) could be a better solution than the > current one, no matter whether we need a full revert or not. > > Thanks, > > > > > > > > > > > > > > > > > > > > I checked the original commit from Andrea and I found that it _was_ correct: > > > > > > > > > > commit 86039bd3b4e6a1129318cbfed4e0a6e001656635 > > > > > Author: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > > > Date: Fri Sep 4 15:46:31 2015 -0700 > > > > > > > > > > userfaultfd: add new syscall to provide memory externalization > > > > > > > > > > I had a feeling that it's broken during the recent rework on vma (or maybe > > > > > even not that close), but I'm not yet sure and need to further check. > > > > > > > > > > > > > > > > > > > By specifying parameters that are compatible with N even though you're only > > > > > > > partially spanning C? > > > > > > > > > > > > > > This is crazy, and isn't how this should be used. vma_merge() is not > > > > > > > supposed to do partial merges. If it works (presumably it does) this is not > > > > > > > by design unless I've lost my mind and I (and others) have somehow not > > > > > > > noticed this?? > > > > > > > > > > > > > > I think you're right that now we'll end up with more fragmentation, but > > > > > > > what you're suggesting is not how vma_merge() is supposed to work. > > > > > > > > > > > > > > As I said above, giving vma_merge() invalid parameters is very dangerous as > > > > > > > you could end up merging over empty ranges in theory (and could otherwise > > > > > > > have corruption). > > > > > > > > > > > > > > I guess we should probably be passing 0 to the last parameter in > > > > > > > split_vma() here then to ensure we do a merge pass too. Will experiment > > > > > > > with this. > > > > > > > > > > > > > > I'm confused as to how the remove from case 8 is not proceeding. I'll look > > > > > > > into this some more... > > > > > > > > > > > > > > Happy to be corrected if I'm misconstruing this! > > > > > > > > > > > > > > > > > > > OK, so I wrote a small program to do perform exactly this case [0] and it seems > > > > > > that the outcome is the same before and after this patch - vma_merge() is > > > > > > clearly rejecting the case 8 merge (phew!) and in both instances you end up with > > > > > > 3 VMAs. > > > > > > > > > > > > So this patch doesn't change this behaviour and everything is as it was > > > > > > before. Ideally we'd let it go for another pass, so maybe we should change the > > > > > > split to add a new VMA _afterwards_. Will experiment with that, separately. > > > > > > > > > > > > But looks like the patch is good as it is. > > > > > > > > > > > > (if you notice something wrong with the repro, etc. do let me know!) > > > > > > > > > > > > [0]: https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e > > > > > > > > > > I think your test case is fine, but... no, this is still not expected. The > > > > > vma should just merge. > > > > > > > > > > So I have another closer look on this specific issue, it seems we have a > > > > > long standing bug on pgoff calculation here when passing that to > > > > > vma_merge(). I've got another patch attached to show what I meant. > > > > > > > > > > To summarize.. now I've got two patches attached: > > > > > > > > > > Patch 1 is something I'd like to propose to replace this patch that fixes > > > > > incorrect use of vma_merge() so it should also eliminate the assertion > > > > > being triggered (I still think this is a regression but I need to check > > > > > which I will do later; I'm not super familiar with maple tree work, maybe > > > > > you or Liam can quickly spot). > > > > > > > > > > Patch 2 fixes the long standing issue of vma not being able to merge in > > > > > above case, and with patch 2 applied it should start merging all right. > > > > > > > > > > Please have a look, thanks. > > > > > > > > > > ---8<--- > > > > > From 6bc39028bba246394bb0bafdaeaab7b8dfd1694f Mon Sep 17 00:00:00 2001 > > > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > > > Date: Tue, 16 May 2023 09:03:22 -0400 > > > > > Subject: [PATCH 1/2] mm/uffd: Fix vma operation where start addr cuts part of > > > > > vma > > > > > > > > > > It seems vma merging with uffd paths is broken with either > > > > > register/unregister, where right now we can feed wrong parameters to > > > > > vma_merge() and it's found by recent patch which moved asserts upwards in > > > > > vma_merge(): > > > > > > > > > > https://lore.kernel.org/all/ZFunF7DmMdK05MoF@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > > > > > > > > > The problem is in the current code base we didn't fixup "prev" for the case > > > > > where "start" address can be within the "prev" vma section. In that case > > > > > we should have "prev" points to the current vma rather than the previous > > > > > one when feeding to vma_merge(). > > > > > > > > > > This will eliminate the report and make sure vma_merge() calls will become > > > > > legal again. > > > > > > > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > > > --- > > > > > fs/userfaultfd.c | 27 +++++++++++++++++++++------ > > > > > 1 file changed, 21 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > index 0fd96d6e39ce..7eb88bc74d00 100644 > > > > > --- a/fs/userfaultfd.c > > > > > +++ b/fs/userfaultfd.c > > > > > @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > BUG_ON(!found); > > > > > > > > > > vma_iter_set(&vmi, start); > > > > > - prev = vma_prev(&vmi); > > > > > + vma = vma_find(&vmi, end); > > > > > + if (!vma) > > > > > + goto out_unlock; > > > > > + > > > > > + if (vma->vm_start < start) > > > > > + prev = vma; > > > > > + else > > > > > + prev = vma_prev(&vmi); > > > > > > > > > > ret = 0; > > > > > - for_each_vma_range(vmi, vma, end) { > > > > > + do { > > > > > cond_resched(); > > > > > > > > > > BUG_ON(!vma_can_userfault(vma, vm_flags)); > > > > > @@ -1517,7 +1524,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > skip: > > > > > prev = vma; > > > > > start = vma->vm_end; > > > > > - } > > > > > + } for_each_vma_range(vmi, vma, end); > > > > > > > > > > out_unlock: > > > > > mmap_write_unlock(mm); > > > > > @@ -1624,9 +1631,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > > > BUG_ON(!found); > > > > > > > > > > vma_iter_set(&vmi, start); > > > > > - prev = vma_prev(&vmi); > > > > > + vma = vma_find(&vmi, end); > > > > > + if (!vma) > > > > > + goto out_unlock; > > > > > + > > > > > + if (vma->vm_start < start) > > > > > + prev = vma; > > > > > + else > > > > > + prev = vma_prev(&vmi); > > > > > + > > > > > ret = 0; > > > > > - for_each_vma_range(vmi, vma, end) { > > > > > + do { > > > > > cond_resched(); > > > > > > > > > > BUG_ON(!vma_can_userfault(vma, vma->vm_flags)); > > > > > @@ -1692,7 +1707,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > > > skip: > > > > > prev = vma; > > > > > start = vma->vm_end; > > > > > - } > > > > > + } for_each_vma_range(vmi, vma, end); > > > > > > > > > > out_unlock: > > > > > mmap_write_unlock(mm); > > > > > -- > > > > > 2.39.1 > > > > > > > > > > From bf61f3c937e9e2ab96ab2bed0005cb7dc74db231 Mon Sep 17 00:00:00 2001 > > > > > From: Peter Xu <peterx@xxxxxxxxxx> > > > > > Date: Tue, 16 May 2023 09:39:38 -0400 > > > > > Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible > > > > > > > > > > We used to not pass in the pgoff correctly when register/unregister uffd > > > > > regions, it caused incorrect behavior on vma merging. > > > > > > > > > > For example, when we have: > > > > >` > > > > > vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) > > > > > > > > > > Then someone unregisters uffd on range (5-9), it should become: > > > > > > > > > > vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) > > > > > > > > > > But with current code it's: > > > > > > > > > > vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) > > > > > > > > > > This patch allows such merge to happen correctly. > > > > > > > > > > This behavior seems to have existed since the 1st day of uffd, keep it just > > > > > as a performance optmization and not copy stable. > > > > > > > > > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > > > Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> > > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > > > > --- > > > > > fs/userfaultfd.c | 8 ++++++-- > > > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > index 7eb88bc74d00..891048b4799f 100644 > > > > > --- a/fs/userfaultfd.c > > > > > +++ b/fs/userfaultfd.c > > > > > @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > bool basic_ioctls; > > > > > unsigned long start, end, vma_end; > > > > > struct vma_iterator vmi; > > > > > + pgoff_t pgoff; > > > > > > > > > > user_uffdio_register = (struct uffdio_register __user *) arg; > > > > > > > > > > @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, > > > > > vma_end = min(end, vma->vm_end); > > > > > > > > > > new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; > > > > > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > > > > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > > > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > > > > + vma->anon_vma, vma->vm_file, pgoff, > > > > > vma_policy(vma), > > > > > ((struct vm_userfaultfd_ctx){ ctx }), > > > > > anon_vma_name(vma)); > > > > > @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > > > unsigned long start, end, vma_end; > > > > > const void __user *buf = (void __user *)arg; > > > > > struct vma_iterator vmi; > > > > > + pgoff_t pgoff; > > > > > > > > > > ret = -EFAULT; > > > > > if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister))) > > > > > @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, > > > > > uffd_wp_range(vma, start, vma_end - start, false); > > > > > > > > > > new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; > > > > > + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > > > > prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags, > > > > > - vma->anon_vma, vma->vm_file, vma->vm_pgoff, > > > > > + vma->anon_vma, vma->vm_file, pgoff, > > > > > vma_policy(vma), > > > > > NULL_VM_UFFD_CTX, anon_vma_name(vma)); > > > > > if (prev) { > > > > > -- > > > > > 2.39.1 > > > > > ---8<--- > > > > > > > > > > -- > > > > > Peter Xu > > > > > > > > > > > > > > > -- > > > Peter Xu > > > > > > > > > > -- > Peter Xu >