Re: [PATCH 2/3] hugetlbfs: close race between MADV_DONTNEED and page fault

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Thu, 5 Oct 2023 15:59:39 -0700

On 10/05/23 09:23, Rik van Riel wrote:
> On Wed, 2023-10-04 at 20:19 -0700, Mike Kravetz wrote:
> > On 10/03/23 23:25, riel@xxxxxxxxxxx wrote:
> > > 
> > > @@ -5457,11 +5460,12 @@ void __unmap_hugepage_range_final(struct
> > > mmu_gather *tlb,
> > >                  * someone else.
> > >                  */
> > >                 __hugetlb_vma_unlock_write_free(vma);
> > > -               i_mmap_unlock_write(vma->vm_file->f_mapping);
> > >         } else {
> > > -               i_mmap_unlock_write(vma->vm_file->f_mapping);
> > >                 hugetlb_vma_unlock_write(vma);
> > >         }
> > > +
> > > +       if (vma->vm_file)
> > > +               i_mmap_unlock_write(vma->vm_file->f_mapping);
> > >  }
> > 
> > In the case of a mmap(hugetlbfs_file_mmap) error, the per-vma hugetlb
> > lock will not be setup.  The hugetlb_vma_lock/unlock routines do not
> > check for this as they were previously always called after the lock
> > was
> > set up.  So, we can now get:
> 
> Wait, the hugetlb_vma_(un)lock_{read,write} functions do
> have checks for the presence of the lock:
> 
> void hugetlb_vma_lock_read(struct vm_area_struct *vma)
> {
>         if (__vma_shareable_lock(vma)) {
>                 struct hugetlb_vma_lock *vma_lock = vma-
> >vm_private_data;
> 
>                 down_read(&vma_lock->rw_sema);
>         } else if (__vma_private_lock(vma)) {
>                 struct resv_map *resv_map = vma_resv_map(vma);
> 
>                 down_read(&resv_map->rw_sema);
>         }
> }
> 
> Both __vma_shareable_lock and __vma_private_lock check that
> vma->vm_private_data points at something.
> 
> Exactly what corner case am I missing here?
> 
> What leaves vma->vm_private_data pointing at something
> invalid?

You are correct.  The checks in hugetlb_vma_(un)lock_{read,write} functions
should be sufficient.

Here is the corner case that needs to be addressed:

mmap
	hugetlbfs_file_mmap
		hugetlb_reserve_pages {
			!VM_MAYSHARE so
			resv_map = resv_map_alloc();
			set_vma_resv_map(vma, resv_map);
			set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
			...
			some error in hugetlb_reserve_pages,
			goto out_err
			out_error:
			if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
				kref_put(&resv_map->refs, resv_map_release);
			return false;

Note that we free resv_map but do not clear vm_private_data.  So, at
unmap time we try to use the lock in the freed resv_map.

Leaving the dangling pointer to a freed structure is BAD.  However, no
code accessed the freed structure until this patch.

I would suggest incorporating this into this patch, or even as a stand
alone patch before this series.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d25db18c9526..a3ae13d0f8fe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1171,8 +1171,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 	VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
 	VM_BUG_ON_VMA(vma->vm_flags & VM_MAYSHARE, vma);
 
-	set_vma_private_data(vma, (get_vma_private_data(vma) &
-				HPAGE_RESV_MASK) | (unsigned long)map);
+	set_vma_private_data(vma, (unsigned long)map);
 }
 
 static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
@@ -6910,8 +6909,10 @@ bool hugetlb_reserve_pages(struct inode *inode,
 		 */
 		if (chg >= 0 && add < 0)
 			region_abort(resv_map, from, to, regions_needed);
-	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		kref_put(&resv_map->refs, resv_map_release);
+		set_vma_resv_map(vma, NULL);
+	}
 	return false;
 }

At one time set_vma_resv_map must have wanted to preserve flags.
However, that is no longer the case so we can remove that code.

Of course, feel free to address some other way if you would like.
 

> > +++ b/mm/hugetlb.c
> > @@ -5503,10 +5503,12 @@ void __unmap_hugepage_range(struct mmu_gather
> > *tlb, struct vm_area_struct *vma,
> >  void __hugetlb_zap_begin(struct vm_area_struct *vma,
> >                          unsigned long *start, unsigned long *end)
> >  {
> > +       if (!vma->vm_file)      /* hugetlbfs_file_mmap error */
> > +               return;
> > +
> 
> This does not seem quite correct, because the locking is needed to
> avoid the race between MADV_DONTNEED and the page fault path.
> 
> 

Note that vma->vm_file is always set for hugetlb vmas except in the mmap
error case.  The first line of code in hugetlb_fault is:

	mapping = vma->vm_file->f_mapping;

So, bailing out early like this should not be an issue.


> > Another way to resolve would be to fix up the hugetlb_vma_lock/unlock
> > routines
> > to check for and handle a null lock.
> 
> I thought I had that already. 
> 

You did.  My bad!  I saw the NULL deref and jumped to conclusions.

> Does __vma_shareable_lock need to check for !vma->vm_file ?

I do not think there is a need.  It has the check for vma->vm_private_data
which should be sufficient.  We only got in trouble in the !VM_MAYSHARE
case because of the stale/invalid pointer.

However, I do still think we could have that early return in the case of
!vma->vm_file I suggested earlier.  There is no need to take the locks
in this case we will not be calling the hugetlb unmap code.
-- 
Mike Kravetz