Re: [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 23, 2024 at 10:20:50AM -0400, Liam R. Howlett wrote:
> * Vlastimil Babka <vbabka@xxxxxxx> [241023 08:59]:
> > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > The mmap_region() function is somewhat terrifying, with spaghetti-like
> > > control flow and numerous means by which issues can arise and incomplete
> > > state, memory leaks and other unpleasantness can occur.
> > >
> > > A large amount of the complexity arises from trying to handle errors late
> > > in the process of mapping a VMA, which forms the basis of recently observed
> > > issues with resource leaks and observable inconsistent state.
> > >
> > > Taking advantage of previous patches in this series we move a number of
> > > checks earlier in the code, simplifying things by moving the core of the
> > > logic into a static internal function __mmap_region().
> > >
> > > Doing this allows us to perform a number of checks up front before we do
> > > any real work, and allows us to unwind the writable unmap check
> > > unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
> > > validation unconditionally also.
> > >
> > > We move a number of things here:
> > >
> > > 1. We preallocate memory for the iterator before we call the file-backed
> > >    memory hook, allowing us to exit early and avoid having to perform
> > >    complicated and error-prone close/free logic. We carefully free
> > >    iterator state on both success and error paths.
> > >
> > > 2. The enclosing mmap_region() function handles the mapping_map_writable()
> > >    logic early. Previously the logic had the mapping_map_writable() at the
> > >    point of mapping a newly allocated file-backed VMA, and a matching
> > >    mapping_unmap_writable() on success and error paths.
> > >
> > >    We now do this unconditionally if this is a file-backed, shared writable
> > >    mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
> > >    doing so does not invalidate the seal check we just performed, and we in
> > >    any case always decrement the counter in the wrapper.
> > >
> > >    We perform a debug assert to ensure a driver does not attempt to do the
> > >    opposite.
> > >
> > > 3. We also move arch_validate_flags() up into the mmap_region()
> > >    function. This is only relevant on arm64 and sparc64, and the check is
> > >    only meaningful for SPARC with ADI enabled. We explicitly add a warning
> > >    for this arch if a driver invalidates this check, though the code ought
> > >    eventually to be fixed to eliminate the need for this.
> > >
> > > With all of these measures in place, we no longer need to explicitly close
> > > the VMA on error paths, as we place all checks which might fail prior to a
> > > call to any driver mmap hook.
> > >
> > > This eliminates an entire class of errors, makes the code easier to reason
> > > about and more robust.
> > >
> > > Reported-by: Jann Horn <jannh@xxxxxxxxxx>
> > > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > > Cc: stable <stable@xxxxxxxxxx>
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx>

Thanks!

> >
> > some nits below
> >
> > > ---
> > >  mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
> > >  1 file changed, 66 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 66edf0ebba94..7d02b47a1895 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> > >  	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
> > >  }
> > >
> > > -unsigned long mmap_region(struct file *file, unsigned long addr,
> > > +static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > >  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > >  		struct list_head *uf)
> > >  {
> > >  	struct mm_struct *mm = current->mm;
> > >  	struct vm_area_struct *vma = NULL;
> > >  	pgoff_t pglen = PHYS_PFN(len);
> > > -	struct vm_area_struct *merge;
> > >  	unsigned long charged = 0;
> > >  	struct vma_munmap_struct vms;
> > >  	struct ma_state mas_detach;
> > >  	struct maple_tree mt_detach;
> > >  	unsigned long end = addr + len;
> > > -	bool writable_file_mapping = false;
> > >  	int error;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > >  	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > > @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	vm_flags_init(vma, vm_flags);
> > >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > >
> > > +	if (vma_iter_prealloc(&vmi, vma)) {
> > > +		error = -ENOMEM;
> > > +		goto free_vma;
> > > +	}
> > > +
> > >  	if (file) {
> > >  		vma->vm_file = get_file(file);
> > >  		error = mmap_file(file, vma);
> > >  		if (error)
> > > -			goto unmap_and_free_vma;
> > > -
> > > -		if (vma_is_shared_maywrite(vma)) {
> > > -			error = mapping_map_writable(file->f_mapping);
> > > -			if (error)
> > > -				goto close_and_free_vma;
> > > -
> > > -			writable_file_mapping = true;
> > > -		}
> > > +			goto unmap_and_free_file_vma;
> > >
> > > +		/* Drivers cannot alter the address of the VMA. */
> > > +		WARN_ON_ONCE(addr != vma->vm_start);
> > >  		/*
> > > -		 * Expansion is handled above, merging is handled below.
> > > -		 * Drivers should not alter the address of the VMA.
> > > +		 * Drivers should not permit writability when previously it was
> > > +		 * disallowed.
> > >  		 */
> > > -		if (WARN_ON((addr != vma->vm_start))) {
> > > -			error = -EINVAL;
> > > -			goto close_and_free_vma;
> > > -		}
> > > +		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > > +				!(vm_flags & VM_MAYWRITE) &&
> > > +				(vma->vm_flags & VM_MAYWRITE));
> > >
> > >  		vma_iter_config(&vmi, addr, end);
> >
> > I wonder if this one could be removed, earlier above we did the same config
> > and neither parameters changed? But it was true before this patch as well,
> > and maybe it's further refactored away later in the series, just noting.
>
> Yes, this was here in case the vma changed address, so it's probably not
> necessary.

Hmm, but this was what we already did so I'd rather leave it in for now and
we can perhaps address it later?

>
> >
> > >  		/*
> > > @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		 * vma again as we may succeed this time.
> > >  		 */
> > >  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > > +			struct vm_area_struct *merge;
> > > +
> > >  			vmg.flags = vma->vm_flags;
> > >  			/* If this fails, state is reset ready for a reattempt. */
> > >  			merge = vma_merge_new_range(&vmg);
> > > @@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  				 */
> > >  				fput(vma->vm_file);
> > >  				vm_area_free(vma);
> > > +				vma_iter_free(&vmi);
> >
> > If we merged successfully, I think this is not necessary? But doesn't hurt?
>
> Yes, it will use the allocations (and re-allocate more if necessary)
> then free the unused allocations once this call path reaches
> commit_merge() with the same vmi, which is nice.
>
> And yes, it is safe to do regardless.

I will remove if this isn't necessary actually, I did think it would be as
I thought maybe we'd preallocate _twice_ here otherwise? But nice that it
all gets cleaned up.

>
> To be honest, this whole block is so rare that I want to delete it
> anyways.

Yeah I mean I'm inclined to agree... but that last commit is somewhat
contentious it seems :)

>
> >
> > >  				vma = merge;
> > >  				/* Update vm_flags to pick up the change. */
> > >  				vm_flags = vma->vm_flags;
> > > -				goto unmap_writable;
> > > +				goto file_expanded;
> > >  			}
> > >  			vma_iter_config(&vmi, addr, end);
> > >  		}
> > > @@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	} else if (vm_flags & VM_SHARED) {
> > >  		error = shmem_zero_setup(vma);
> > >  		if (error)
> > > -			goto free_vma;
> > > +			goto free_iter_vma;
> > >  	} else {
> > >  		vma_set_anonymous(vma);
> > >  	}
> > >
> > > -	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> > > -		error = -EACCES;
> > > -		goto close_and_free_vma;
> > > -	}
> > > -
> > > -	/* Allow architectures to sanity-check the vm_flags */
> > > -	if (!arch_validate_flags(vma->vm_flags)) {
> > > -		error = -EINVAL;
> > > -		goto close_and_free_vma;
> > > -	}
> > > -
> > > -	if (vma_iter_prealloc(&vmi, vma)) {
> > > -		error = -ENOMEM;
> > > -		goto close_and_free_vma;
> > > -	}
> > > +#ifdef CONFIG_SPARC64
> > > +	/* TODO: Fix SPARC ADI! */
> > > +	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > > +#endif
> > >
> > >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> > >  	vma_start_write(vma);
> > > @@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	 */
> > >  	khugepaged_enter_vma(vma, vma->vm_flags);
> > >
> > > -	/* Once vma denies write, undo our temporary denial count */
> > > -unmap_writable:
> > > -	if (writable_file_mapping)
> > > -		mapping_unmap_writable(file->f_mapping);
> > > +file_expanded:
> > >  	file = vma->vm_file;
> > >  	ksm_add_vma(vma);
> > >  expanded:
> > > @@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >
> > >  	vma_set_page_prot(vma);
> > >
> > > -	validate_mm(mm);
> > >  	return addr;
> > >
> > > -close_and_free_vma:
> > > -	vma_close(vma);
> > > -
> > > -	if (file || vma->vm_file) {
> > > -unmap_and_free_vma:
> > > -		fput(vma->vm_file);
> > > -		vma->vm_file = NULL;
> > > +unmap_and_free_file_vma:
> > > +	fput(vma->vm_file);
> > > +	vma->vm_file = NULL;
> > >
> > > -		vma_iter_set(&vmi, vma->vm_end);
> > > -		/* Undo any partial mapping done by a device driver. */
> > > -		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > > -	}
> > > -	if (writable_file_mapping)
> > > -		mapping_unmap_writable(file->f_mapping);
> > > +	vma_iter_set(&vmi, vma->vm_end);
> > > +	/* Undo any partial mapping done by a device driver. */
> > > +	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > > +free_iter_vma:
> > > +	vma_iter_free(&vmi);
> > >  free_vma:
> > >  	vm_area_free(vma);
> > >  unacct_error:
> > > @@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  abort_munmap:
> > >  	vms_abort_munmap_vmas(&vms, &mas_detach);
> > >  gather_failed:
> > > -	validate_mm(mm);
> > >  	return error;
> > >  }
> > >
> > > +unsigned long mmap_region(struct file *file, unsigned long addr,
> > > +			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > > +			  struct list_head *uf)
> > > +{
> > > +	unsigned long ret;
> > > +	bool writable_file_mapping = false;
> > > +
> > > +	/* Allow architectures to sanity-check the vm_flags. */
> > > +	if (!arch_validate_flags(vm_flags))
> > > +		return -EINVAL;
> > > +
> > > +	/* Check to see if MDWE is applicable. */
> > > +	if (map_deny_write_exec(vm_flags, vm_flags))
> > > +		return -EACCES;
> >
> > The two checks above used to be in the opposite order. Can we keep that just
> > to be sure we don't change user observable behavior unnecessarily?

Ack will do

> >
> > > +	/* Map writable and ensure this isn't a sealed memfd. */
> > > +	if (file && is_shared_maywrite(vm_flags)) {
> > > +		int error = mapping_map_writable(file->f_mapping);
> > > +
> > > +		if (error)
> > > +			return error;
> > > +		writable_file_mapping = true;
> > > +	}
> > > +
> > > +	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
> > > +
> > > +	/* Clear our write mapping regardless of error. */
> > > +	if (writable_file_mapping)
> > > +		mapping_unmap_writable(file->f_mapping);
> > > +
> > > +	validate_mm(current->mm);
> > > +	return ret;
> > > +}
> > > +
> > >  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> > >  {
> > >  	int ret;
> > > --
> > > 2.47.0
> >




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux