Re: [PATCH v2] mm: userfaultfd: avoid passing an invalid range to vma_merge()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



* Peter Xu <peterx@xxxxxxxxxx> [230516 17:40]:
> On Tue, May 16, 2023 at 10:01:06PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 16, 2023 at 04:30:11PM -0400, Peter Xu wrote:
> > > On Tue, May 16, 2023 at 08:25:13PM +0100, Lorenzo Stoakes wrote:
> > > > On Tue, May 16, 2023 at 11:06:40AM -0400, Peter Xu wrote:
> > > > > On Tue, May 16, 2023 at 12:07:11AM +0100, Lorenzo Stoakes wrote:
> > > > > > On Mon, May 15, 2023 at 11:04:27PM +0100, Lorenzo Stoakes wrote:
> > > > > > [snip]
> > > > > > > > Could you explain a bit why we don't need to merge in this case?
> > > > > > > >
> > > > > > > > I'm considering, for example, when we have:
> > > > > > > >
> > > > > > > >   vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
> > > > > > > >
> > > > > > > > Then someone unregisters uffd on range (5-9), iiuc it should become:
> > > > > > > >
> > > > > > > >   vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
> > > > > > > >
> > > > > > > > But if no merge here it's:
> > > > > > > >
> > > > > > > >   vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
> > > > > > > >
> > > > > > > > Maybe I missed something?
> > > > > > > >
> > > > > > >
> > > > > > > There's something really, really wrong with this. It simply isn't valid to
> > > > > > > invoke vma_merge() over an existing VMA that != prev where you're not
> > > > > > > specifying addr = vma->vm_start, end == vma->vm_end.
> > > > > > >
> > > > > > > This seems like you're relying on:-
> > > > > > >
> > > > > > >   ***
> > > > > > > CCCCCNNNNN -> CCNNNNNNNN
> > > > >
> > > > > I had a closer look today, I still think this patch is not really the right
> > > > > one.  The split/merge order is something we use everywhere and I am not
> > > > > convinced it must change as drastic.  At least so far it still seems to me
> > > > > if we do with what current patch proposed we can have vma fragmentations.
> > > >
> > > > 'something we use everywhere' is not an argument (speak to Willy about
> > > > folios), vma_merge() expects valid input, relying on it _happening_ to be
> > >
> > > I still think it's an argument.
> > >
> > > I believe Matthew tried hard to justify he's correct in that aspect when
> > > changing "something we used everywhere". Blindly referencing it doesn't
> > > help much on a technical discussion.
> > >
> > > If we have similar code that handles similar things, IMHO you need a reason
> > > to modify one of them to not like the other.
> > >
> > > If you think what you proposed is better, please consider (1) justify why
> > > it's better, and then (2) also apply it to all the rest places where
> > > applicable.  Please refer to my reply to Liam on the other use cases of
> > > vma_merge().
> > >
> > > Said that, I apprecaite a lot on your assert patch that found this problem.
> > >
> > > > ok or to fail in ways that _happen_ not to cause big problems is not right.
> > > >
> > > > This is just further evidence that the vma_merge() interface is
> > > > fundamentally broken. Implicitly assuming you will only get a partial prev
> > > > overlap merge is far from intuitive.
> > > >
> > > > I am definitely going to try to do a series addressing vma_merge() horrors
> > > > because I feel like we need a generic means of doing this split/merge pattern.
> > >
> > > It'll be great if you can make it better.
> > >
> > > >
> > > > >
> > > > > I think I see what you meant, but here I think it's a legal case where we
> > > > > should have PPPP rather than CCCC (PPPPPNNNN --> PPNNNNNNNN).
> > > > >
> > > > > To be explicit, for register I think it _should_ be the case 0 where we
> > > > > cannot merge (note: the current code is indeed wrong though, see below):
> > > > >
> > > > >    ****
> > > > >   PPPPPPNNNNNN
> > > > >   cannot merge
> > > > >
> > > > > While for the unregister case here it's case 4:
> > > > >
> > > > >     ****
> > > > >   PPPPPPNNNNNN
> > > > >   might become
> > > > >   PPNNNNNNNNNN
> > > > >   case 4 below
> > > > >
> > > > > Here the problem is not that we need to do split then merge (I think it'll
> > > > > have the problem of vma defragmentation here), the problem is we simply
> > > > > passed in the wrong "prev" vma pointer, IMHO.  I've patches attached
> > > > > showing what I meant.
> > > >
> > > > Yeah if you do it with prev = vma this form should _probably_ work, that's
> > > > a good point. This _is_ a case (see https://ljs.io/vma_merge_cases.png for
> > > > nice diagram of cases btw :), like 5, where we actually do split and merge
> > > > at the same time.
> > >
> > > It's not something I came up with myself, it's just that I started looking
> > > back to see what people did and trying to understand why they did it.
> > > Normally people did things with good reasons.
> > >
> > 
> > Yup, it's an iffy pattern in each case. I can see why you chose to do it,
> > it's not a criticism of that, it's just that the incorrect (but
> > accidentally ok to be incorrect seemingly) input is made more obvious by
> > recent changes.
> > 
> > > So far it seems clearer at least to me to keep this pattern of "merge then
> > > split".  But I'm happy to be proven wrong anytime.
> > 
> > But you're not, you're doing a vma_merge() and (without it being clear)
> > hoping it does a merge AND SPLIT in case in an instance where that might be
> > required followed by you doing any further splits required.
> 
> I don't quite get the reason that you seem to keep saying this is
> "incorrect input" to vma_merge().  vma_merge() definitely includes splits
> too where it can, to be explicit, for case 4 & 5.
> 
> It seems to me what you're trying to explain is we shouldn't handle any
> split in vma_merge() so we should move cases 4 & 5 out of vma_merge().  If
> we split first then merge, cases 4 & 5 will become case 2 & 3 after split.

We don't split in case 4 or 5 - we adjust the existing VMA limits.  We
don't actually handle any splits in vma_merge().  I think splitting
first would change 4 & 5 to 7 & 8?  2 & 3 would require a split and
munmap, right?

> My question would be: if it worked perfect in the past few years and it
> looks all good enough, why bother..

I suspect, but it's not clear (like all of this), that the other
arguments to vma_merge() is ruling out this potential hazard I thought
existed.

> 
> > 
> > But it's _not your fault_ that this is the standard approach, it's the
> > interface that's wrong absolutely.
> > 
> > To me doing this is quite a bit less clear than simply 'splitting so this
> > is mergable first then try to merge' but obviously this current patch does
> > not avoid the previously introduced fragmentation. However it does maintain
> > existing behaviour.
> > 
> > >
> > > >
> > > > Liam's raised some issues with the safety of your patches, let me look at
> > > > them when I get a chance, nasty headcold = brain less functional atm.
> > >
> > > Sure, no need to rush.
> > >
> > > >
> > > > I would say for now this fix resolves the issue in a way that should
> > > > emphatically avoid invalid input to vma_merge(), the fragmentation existed
> > > > before so this is not a new issue, so for the time being I think it's ok to
> > > > stay as-is.
> > >
> > > So far I would still suggest having uffd code the same as the rest if
> > > possible.
> > >
> > 
> > Absolutely, but in the immediate term, we are seeing VM_WARN_ON()'s here
> > and not from other callers (which is actually surprising).
> 
> Not surprising anymore to me, simply because the uffd path was the only one
> got overlooked when converting to maple in commit 69dbe6daf104, as Liam
> rightfully pointed out.
> 
> For example, mprotect() has similar handling when the start addr of the
> range can be in the middle of a vma, then in do_mprotect_pkey() there is:
> 
> 	prev = vma_prev(&vmi);
> 	if (start > vma->vm_start)
> 		prev = vma;
> 
> IMHO it's the same thing.
> 
> > 
> > Again, we absolutely do need an abstracted solution for the whole. And
> > that's something I'll try to work on!
> > 
> > > I think I'll wait for the other discussion to settle on patch 2 to see
> > > whether I should post a formal patchset.
> > 
> > My suggestion is that it's ok to proceed as-is, not because this is the
> > final change that should be applied, but rather because it resolves the
> > issue while maintaining existing behaviour.
> > 
> > I will be more than happy to see patches land after this one that replace
> > it if necessary but I think it's safe for this to land as a mainline fixup,
> > even if temporary, is all I am saying :)
> 
> I'd hope we can replace this patch with my patch 1 if possible because I
> _think_ this patch is still in hotfixes-unstable (while patch 2 doesn't
> need to copy stable in all cases).  Andrew may know better on how to
> proceed.
> 
> If this lands first, I'll probably propose a full revert otherwise as the
> 1st patch of the patchset to post, then make uffd the same as others.
> 
> Before that I'd like to know whether you agree that the new patch 1 (I'll
> fixup the vma_prev() side effect) could be a better solution than the
> current one, no matter whether we need a full revert or not.
> 
> Thanks,
> 
> > 
> > >
> > > >
> > > > >
> > > > > I checked the original commit from Andrea and I found that it _was_ correct:
> > > > >
> > > > > commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
> > > > > Author: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > > > > Date:   Fri Sep 4 15:46:31 2015 -0700
> > > > >
> > > > >     userfaultfd: add new syscall to provide memory externalization
> > > > >
> > > > > I had a feeling that it's broken during the recent rework on vma (or maybe
> > > > > even not that close), but I'm not yet sure and need to further check.
> > > > >
> > > > > > >
> > > > > > > By specifying parameters that are compatible with N even though you're only
> > > > > > > partially spanning C?
> > > > > > >
> > > > > > > This is crazy, and isn't how this should be used. vma_merge() is not
> > > > > > > supposed to do partial merges. If it works (presumably it does) this is not
> > > > > > > by design unless I've lost my mind and I (and others) have somehow not
> > > > > > > noticed this??
> > > > > > >
> > > > > > > I think you're right that now we'll end up with more fragmentation, but
> > > > > > > what you're suggesting is not how vma_merge() is supposed to work.
> > > > > > >
> > > > > > > As I said above, giving vma_merge() invalid parameters is very dangerous as
> > > > > > > you could end up merging over empty ranges in theory (and could otherwise
> > > > > > > have corruption).
> > > > > > >
> > > > > > > I guess we should probably be passing 0 to the last parameter in
> > > > > > > split_vma() here then to ensure we do a merge pass too. Will experiment
> > > > > > > with this.
> > > > > > >
> > > > > > > I'm confused as to how the remove from case 8 is not proceeding. I'll look
> > > > > > > into this some more...
> > > > > > >
> > > > > > > Happy to be corrected if I'm misconstruing this!
> > > > > > >
> > > > > >
> > > > > > OK, so I wrote a small program to do perform exactly this case [0] and it seems
> > > > > > that the outcome is the same before and after this patch - vma_merge() is
> > > > > > clearly rejecting the case 8 merge (phew!) and in both instances you end up with
> > > > > > 3 VMAs.
> > > > > >
> > > > > > So this patch doesn't change this behaviour and everything is as it was
> > > > > > before. Ideally we'd let it go for another pass, so maybe we should change the
> > > > > > split to add a new VMA _afterwards_. Will experiment with that, separately.
> > > > > >
> > > > > > But looks like the patch is good as it is.
> > > > > >
> > > > > > (if you notice something wrong with the repro, etc. do let me know!)
> > > > > >
> > > > > > [0]: https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e
> > > > >
> > > > > I think your test case is fine, but... no, this is still not expected. The
> > > > > vma should just merge.
> > > > >
> > > > > So I have another closer look on this specific issue, it seems we have a
> > > > > long standing bug on pgoff calculation here when passing that to
> > > > > vma_merge().  I've got another patch attached to show what I meant.
> > > > >
> > > > > To summarize.. now I've got two patches attached:
> > > > >
> > > > > Patch 1 is something I'd like to propose to replace this patch that fixes
> > > > > incorrect use of vma_merge() so it should also eliminate the assertion
> > > > > being triggered (I still think this is a regression but I need to check
> > > > > which I will do later; I'm not super familiar with maple tree work, maybe
> > > > > you or Liam can quickly spot).
> > > > >
> > > > > Patch 2 fixes the long standing issue of vma not being able to merge in
> > > > > above case, and with patch 2 applied it should start merging all right.
> > > > >
> > > > > Please have a look, thanks.
> > > > >
> > > > > ---8<---
> > > > > From 6bc39028bba246394bb0bafdaeaab7b8dfd1694f Mon Sep 17 00:00:00 2001
> > > > > From: Peter Xu <peterx@xxxxxxxxxx>
> > > > > Date: Tue, 16 May 2023 09:03:22 -0400
> > > > > Subject: [PATCH 1/2] mm/uffd: Fix vma operation where start addr cuts part of
> > > > >  vma
> > > > >
> > > > > It seems vma merging with uffd paths is broken with either
> > > > > register/unregister, where right now we can feed wrong parameters to
> > > > > vma_merge() and it's found by recent patch which moved asserts upwards in
> > > > > vma_merge():
> > > > >
> > > > > https://lore.kernel.org/all/ZFunF7DmMdK05MoF@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> > > > >
> > > > > The problem is in the current code base we didn't fixup "prev" for the case
> > > > > where "start" address can be within the "prev" vma section.  In that case
> > > > > we should have "prev" points to the current vma rather than the previous
> > > > > one when feeding to vma_merge().
> > > > >
> > > > > This will eliminate the report and make sure vma_merge() calls will become
> > > > > legal again.
> > > > >
> > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/userfaultfd.c | 27 +++++++++++++++++++++------
> > > > >  1 file changed, 21 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > index 0fd96d6e39ce..7eb88bc74d00 100644
> > > > > --- a/fs/userfaultfd.c
> > > > > +++ b/fs/userfaultfd.c
> > > > > @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  	BUG_ON(!found);
> > > > >
> > > > >  	vma_iter_set(&vmi, start);
> > > > > -	prev = vma_prev(&vmi);
> > > > > +	vma = vma_find(&vmi, end);
> > > > > +	if (!vma)
> > > > > +		goto out_unlock;
> > > > > +
> > > > > +	if (vma->vm_start < start)
> > > > > +		prev = vma;
> > > > > +	else
> > > > > +		prev = vma_prev(&vmi);
> > > > >
> > > > >  	ret = 0;
> > > > > -	for_each_vma_range(vmi, vma, end) {
> > > > > +	do {
> > > > >  		cond_resched();
> > > > >
> > > > >  		BUG_ON(!vma_can_userfault(vma, vm_flags));
> > > > > @@ -1517,7 +1524,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  	skip:
> > > > >  		prev = vma;
> > > > >  		start = vma->vm_end;
> > > > > -	}
> > > > > +	} for_each_vma_range(vmi, vma, end);
> > > > >
> > > > >  out_unlock:
> > > > >  	mmap_write_unlock(mm);
> > > > > @@ -1624,9 +1631,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > > >  	BUG_ON(!found);
> > > > >
> > > > >  	vma_iter_set(&vmi, start);
> > > > > -	prev = vma_prev(&vmi);
> > > > > +	vma = vma_find(&vmi, end);
> > > > > +	if (!vma)
> > > > > +		goto out_unlock;
> > > > > +
> > > > > +	if (vma->vm_start < start)
> > > > > +		prev = vma;
> > > > > +	else
> > > > > +		prev = vma_prev(&vmi);
> > > > > +
> > > > >  	ret = 0;
> > > > > -	for_each_vma_range(vmi, vma, end) {
> > > > > +	do {
> > > > >  		cond_resched();
> > > > >
> > > > >  		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> > > > > @@ -1692,7 +1707,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > > >  	skip:
> > > > >  		prev = vma;
> > > > >  		start = vma->vm_end;
> > > > > -	}
> > > > > +	} for_each_vma_range(vmi, vma, end);
> > > > >
> > > > >  out_unlock:
> > > > >  	mmap_write_unlock(mm);
> > > > > --
> > > > > 2.39.1
> > > > >
> > > > > From bf61f3c937e9e2ab96ab2bed0005cb7dc74db231 Mon Sep 17 00:00:00 2001
> > > > > From: Peter Xu <peterx@xxxxxxxxxx>
> > > > > Date: Tue, 16 May 2023 09:39:38 -0400
> > > > > Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible
> > > > >
> > > > > We used to not pass in the pgoff correctly when register/unregister uffd
> > > > > regions, it caused incorrect behavior on vma merging.
> > > > >
> > > > > For example, when we have:
> > > > >`
> > > > >   vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
> > > > >
> > > > > Then someone unregisters uffd on range (5-9), it should become:
> > > > >
> > > > >   vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
> > > > >
> > > > > But with current code it's:
> > > > >
> > > > >   vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
> > > > >
> > > > > This patch allows such merge to happen correctly.
> > > > >
> > > > > This behavior seems to have existed since the 1st day of uffd, keep it just
> > > > > as a performance optmization and not copy stable.
> > > > >
> > > > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > > > > Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx>
> > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/userfaultfd.c | 8 ++++++--
> > > > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > > index 7eb88bc74d00..891048b4799f 100644
> > > > > --- a/fs/userfaultfd.c
> > > > > +++ b/fs/userfaultfd.c
> > > > > @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  	bool basic_ioctls;
> > > > >  	unsigned long start, end, vma_end;
> > > > >  	struct vma_iterator vmi;
> > > > > +	pgoff_t pgoff;
> > > > >
> > > > >  	user_uffdio_register = (struct uffdio_register __user *) arg;
> > > > >
> > > > > @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> > > > >  		vma_end = min(end, vma->vm_end);
> > > > >
> > > > >  		new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> > > > > +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> > > > >  		prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> > > > > -				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > > > > +				 vma->anon_vma, vma->vm_file, pgoff,
> > > > >  				 vma_policy(vma),
> > > > >  				 ((struct vm_userfaultfd_ctx){ ctx }),
> > > > >  				 anon_vma_name(vma));
> > > > > @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > > >  	unsigned long start, end, vma_end;
> > > > >  	const void __user *buf = (void __user *)arg;
> > > > >  	struct vma_iterator vmi;
> > > > > +	pgoff_t pgoff;
> > > > >
> > > > >  	ret = -EFAULT;
> > > > >  	if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> > > > > @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> > > > >  			uffd_wp_range(vma, start, vma_end - start, false);
> > > > >
> > > > >  		new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> > > > > +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> > > > >  		prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> > > > > -				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> > > > > +				 vma->anon_vma, vma->vm_file, pgoff,
> > > > >  				 vma_policy(vma),
> > > > >  				 NULL_VM_UFFD_CTX, anon_vma_name(vma));
> > > > >  		if (prev) {
> > > > > --
> > > > > 2.39.1
> > > > > ---8<---
> > > > >
> > > > > --
> > > > > Peter Xu
> > > > >
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> > >
> > 
> 
> -- 
> Peter Xu
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux