Re: [PATCH v2] mm: userfaultfd: avoid passing an invalid range to vma_merge()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



* Peter Xu <peterx@xxxxxxxxxx> [230516 11:06]:

...

> > > This seems like you're relying on:-
> > >
> > >   ***
> > > CCCCCNNNNN -> CCNNNNNNNN
> 
> I had a closer look today, I still think this patch is not really the right
> one.  The split/merge order is something we use everywhere and I am not
> convinced it must change as drastic.  At least so far it still seems to me
> if we do with what current patch proposed we can have vma fragmentations.
> 
> I think I see what you meant, but here I think it's a legal case where we
> should have PPPP rather than CCCC (PPPPPNNNN --> PPNNNNNNNN).
> 
> To be explicit, for register I think it _should_ be the case 0 where we
> cannot merge (note: the current code is indeed wrong though, see below):
> 
>    ****
>   PPPPPPNNNNNN
>   cannot merge
> 
> While for the unregister case here it's case 4:
> 
>     ****
>   PPPPPPNNNNNN
>   might become
>   PPNNNNNNNNNN
>   case 4 below
> 
> Here the problem is not that we need to do split then merge (I think it'll
> have the problem of vma defragmentation here), the problem is we simply
> passed in the wrong "prev" vma pointer, IMHO.  I've patches attached
> showing what I meant.
> 
> I checked the original commit from Andrea and I found that it _was_ correct:
> 
> commit 86039bd3b4e6a1129318cbfed4e0a6e001656635
> Author: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Date:   Fri Sep 4 15:46:31 2015 -0700
> 
>     userfaultfd: add new syscall to provide memory externalization
> 
> I had a feeling that it's broken during the recent rework on vma (or maybe
> even not that close), but I'm not yet sure and need to further check.

I believe this was 69dbe6daf1041e32e003f966d71f70f20c63af53, which is my
work on this area.  Any patches will need to be backported to before
the vma iterator for 6.1

I think keeping the asserts and ensuring we are calling vma_merge() with
arguments that make sense is safer.

> 
> > >
> > > By specifying parameters that are compatible with N even though you're only
> > > partially spanning C?
> > >
> > > This is crazy, and isn't how this should be used. vma_merge() is not
> > > supposed to do partial merges. If it works (presumably it does) this is not
> > > by design unless I've lost my mind and I (and others) have somehow not
> > > noticed this??
> > >
> > > I think you're right that now we'll end up with more fragmentation, but
> > > what you're suggesting is not how vma_merge() is supposed to work.
> > >
> > > As I said above, giving vma_merge() invalid parameters is very dangerous as
> > > you could end up merging over empty ranges in theory (and could otherwise
> > > have corruption).
> > >
> > > I guess we should probably be passing 0 to the last parameter in
> > > split_vma() here then to ensure we do a merge pass too. Will experiment
> > > with this.
> > >
> > > I'm confused as to how the remove from case 8 is not proceeding. I'll look
> > > into this some more...
> > >
> > > Happy to be corrected if I'm misconstruing this!
> > >
> > 
> > OK, so I wrote a small program to do perform exactly this case [0] and it seems
> > that the outcome is the same before and after this patch - vma_merge() is
> > clearly rejecting the case 8 merge (phew!) and in both instances you end up with
> > 3 VMAs.
> > 
> > So this patch doesn't change this behaviour and everything is as it was
> > before. Ideally we'd let it go for another pass, so maybe we should change the
> > split to add a new VMA _afterwards_. Will experiment with that, separately.
> > 
> > But looks like the patch is good as it is.
> > 
> > (if you notice something wrong with the repro, etc. do let me know!)
> > 
> > [0]: https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e
> 
> I think your test case is fine, but... no, this is still not expected. The
> vma should just merge.
> 
> So I have another closer look on this specific issue, it seems we have a
> long standing bug on pgoff calculation here when passing that to
> vma_merge().  I've got another patch attached to show what I meant.
> 
> To summarize.. now I've got two patches attached:
> 
> Patch 1 is something I'd like to propose to replace this patch that fixes
> incorrect use of vma_merge() so it should also eliminate the assertion
> being triggered (I still think this is a regression but I need to check
> which I will do later; I'm not super familiar with maple tree work, maybe
> you or Liam can quickly spot).
> 
> Patch 2 fixes the long standing issue of vma not being able to merge in
> above case, and with patch 2 applied it should start merging all right.
> 
> Please have a look, thanks.
> 
> ---8<---
> From 6bc39028bba246394bb0bafdaeaab7b8dfd1694f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@xxxxxxxxxx>
> Date: Tue, 16 May 2023 09:03:22 -0400
> Subject: [PATCH 1/2] mm/uffd: Fix vma operation where start addr cuts part of
>  vma
> 
> It seems vma merging with uffd paths is broken with either
> register/unregister, where right now we can feed wrong parameters to
> vma_merge() and it's found by recent patch which moved asserts upwards in
> vma_merge():
> 
> https://lore.kernel.org/all/ZFunF7DmMdK05MoF@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> 
> The problem is in the current code base we didn't fixup "prev" for the case
> where "start" address can be within the "prev" vma section.  In that case
> we should have "prev" points to the current vma rather than the previous
> one when feeding to vma_merge().
> 
> This will eliminate the report and make sure vma_merge() calls will become
> legal again.
> 
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
>  fs/userfaultfd.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..7eb88bc74d00 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1458,10 +1458,17 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	BUG_ON(!found);
>  
>  	vma_iter_set(&vmi, start);
> -	prev = vma_prev(&vmi);
> +	vma = vma_find(&vmi, end);
> +	if (!vma)
> +		goto out_unlock;
> +
> +	if (vma->vm_start < start)
> +		prev = vma;
> +	else
> +		prev = vma_prev(&vmi);
>  
>  	ret = 0;
> -	for_each_vma_range(vmi, vma, end) {
> +	do {

The iterator may be off by one, depending on if vma_prev() is called or
not.

Perhaps:
	prev = vma_prev(&vmi); /* could be wrong, or null */
	vma = vma_find(&vmi, end);
	if (!vma)
		goto out_unlock;

	if (vma->vm_start < start)
		prev = vma;

now we know we are at vma with the iterator..
	ret = 0
	do{
	...


>  		cond_resched();
>  
>  		BUG_ON(!vma_can_userfault(vma, vm_flags));
> @@ -1517,7 +1524,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	skip:
>  		prev = vma;
>  		start = vma->vm_end;
> -	}
> +	} for_each_vma_range(vmi, vma, end);
>  
>  out_unlock:
>  	mmap_write_unlock(mm);
> @@ -1624,9 +1631,17 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	BUG_ON(!found);
>  
>  	vma_iter_set(&vmi, start);
> -	prev = vma_prev(&vmi);
> +	vma = vma_find(&vmi, end);
> +	if (!vma)
> +		goto out_unlock;
> +
> +	if (vma->vm_start < start)
> +		prev = vma;
> +	else
> +		prev = vma_prev(&vmi);
> +
>  	ret = 0;
> -	for_each_vma_range(vmi, vma, end) {
> +	do {
>  		cond_resched();
>  
>  		BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
> @@ -1692,7 +1707,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	skip:
>  		prev = vma;
>  		start = vma->vm_end;
> -	}
> +	} for_each_vma_range(vmi, vma, end);
>  
>  out_unlock:
>  	mmap_write_unlock(mm);
> -- 
> 2.39.1
> 
> From bf61f3c937e9e2ab96ab2bed0005cb7dc74db231 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@xxxxxxxxxx>
> Date: Tue, 16 May 2023 09:39:38 -0400
> Subject: [PATCH 2/2] mm/uffd: Allow vma to merge as much as possible
> 
> We used to not pass in the pgoff correctly when register/unregister uffd
> regions, it caused incorrect behavior on vma merging.
> 
> For example, when we have:
> 
>   vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)
> 
> Then someone unregisters uffd on range (5-9), it should become:
> 
>   vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)
> 
> But with current code it's:
> 
>   vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)
> 
> This patch allows such merge to happen correctly.
> 
> This behavior seems to have existed since the 1st day of uffd, keep it just
> as a performance optmization and not copy stable.
> 
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: Mike Rapoport (IBM) <rppt@xxxxxxxxxx>
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
>  fs/userfaultfd.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 7eb88bc74d00..891048b4799f 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1332,6 +1332,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  	bool basic_ioctls;
>  	unsigned long start, end, vma_end;
>  	struct vma_iterator vmi;
> +	pgoff_t pgoff;
>  
>  	user_uffdio_register = (struct uffdio_register __user *) arg;
>  
> @@ -1489,8 +1490,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>  		vma_end = min(end, vma->vm_end);
>  
>  		new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);

I don't think this is safe.  You are telling vma_merge() something that
is not true and will result in can_vma_merge_before() passing.  I mean,
sure it will become true after you split (unless you can't?), but I
don't know if you can just merge a VMA that doesn't pass
can_vma_merge_before(), even for a short period?

>  		prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> -				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> +				 vma->anon_vma, vma->vm_file, pgoff,
>  				 vma_policy(vma),
>  				 ((struct vm_userfaultfd_ctx){ ctx }),
>  				 anon_vma_name(vma));
> @@ -1570,6 +1572,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  	unsigned long start, end, vma_end;
>  	const void __user *buf = (void __user *)arg;
>  	struct vma_iterator vmi;
> +	pgoff_t pgoff;
>  
>  	ret = -EFAULT;
>  	if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
> @@ -1677,8 +1680,9 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
>  			uffd_wp_range(vma, start, vma_end - start, false);
>  
>  		new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> +		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
>  		prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
> -				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
> +				 vma->anon_vma, vma->vm_file, pgoff,
>  				 vma_policy(vma),
>  				 NULL_VM_UFFD_CTX, anon_vma_name(vma));
>  		if (prev) {
> -- 
> 2.39.1
> ---8<---
> 
> -- 
> Peter Xu
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux