On Mon, Feb 03, 2025 at 08:52:06AM +0100, Ricardo Cañuelo Navarro wrote: > Add a sanity check to madvise_dontneed_free() to address a corner case > in madvise where a race condition causes the current vma being processed > to be backed by a different page size. > > During a madvise(MADV_DONTNEED) call on a memory region registered with > a userfaultfd, there's a period of time where the process mm lock is > temporarily released in order to send a UFFD_EVENT_REMOVE and let > userspace handle the event. During this time, the vma covering the > current address range may change due to an explicit mmap done > concurrently by another thread. > > If, after that change, the memory region, which was originally backed by > 4KB pages, is now backed by hugepages, the end address is rounded down > to a hugepage boundary to avoid data loss (see "Fixes" below). This > rounding may cause the end address to be truncated to the same address > as the start. > > Make this corner case follow the same semantics as in other similar > cases where the requested region has zero length (ie. return 0). > > This will make madvise_walk_vmas() continue to the next vma in the > range (this time holding the process mm lock) which, due to the prev > pointer becoming stale because of the vma change, will be the same > hugepage-backed vma that was just checked before. The next time > madvise_dontneed_free() runs for this vma, if the start address isn't > aligned to a hugepage boundary, it'll return -EINVAL, which is also in > line with the madvise api. > > From userspace perspective, madvise() will return EINVAL because the > start address isn't aligned according to the new vma alignment > requirements (hugepage), even though it was correctly page-aligned when > the call was issued. > > Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Ricardo Cañuelo Navarro <rcn@xxxxxxxxxx> Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> > --- > Changes in v2: > - Added documentation in the code to tell the user how this situation > can happen. (Andrew) > --- > mm/madvise.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 49f3a75046f6..08b207f8e61e 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -933,7 +933,16 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > */ > end = vma->vm_end; > } > - VM_WARN_ON(start >= end); > + /* > + * If the memory region between start and end was > + * originally backed by 4kB pages and then remapped to > + * be backed by hugepages while mmap_lock was dropped, > + * the adjustment for hugetlb vma above may have rounded > + * end down to the start address. > + */ > + if (start == end) > + return 0; > + VM_WARN_ON(start > end); The change itself looks fine to me, although I am wondering whether it would make more sense to place the check right after the call to madvise_dontneed_free_valid_vma(). It looks kind of more logical to me, but not a big deal. -- Oscar Salvador SUSE Labs