Re: [PATCH v2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Sat, 29 Oct 2022 17:15:55 -0700

On 10/28/22 19:20, Peter Xu wrote:
> On Fri, Oct 28, 2022 at 02:17:01PM -0700, Mike Kravetz wrote:
> > On 10/28/22 12:13, Peter Xu wrote:
> > > On Fri, Oct 28, 2022 at 08:23:25AM -0700, Mike Kravetz wrote:
> > > > On 10/26/22 21:12, Peter Xu wrote:
> > > > > On Wed, Oct 26, 2022 at 04:54:01PM -0700, Mike Kravetz wrote:
> > > > > > On 10/26/22 17:42, Peter Xu wrote:
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index c7105ec6d08c..d8b4d7e56939 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > >  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > >  					unsigned long start, unsigned long end)
> > > >  {
> > > > -	zap_page_range(vma, start, end - start);
> > > > +	if (!is_vm_hugetlb_page(vma))
> > > > +		zap_page_range(vma, start, end - start);
> > > > +	else
> > > > +		clear_hugetlb_page_range(vma, start, end);
> > > 
> > > With the new ZAP_FLAG_UNMAP flag, clear_hugetlb_page_range() can be dropped
> > > completely?  As zap_page_range() won't be with ZAP_FLAG_UNMAP so we can
> > > identify things?
> > > 
> > > IIUC that's the major reason why I thought the zap flag could be helpful..
> > 
> > Argh.  I went to drop clear_hugetlb_page_range() but there is one issue.
> > In zap_page_range() the MMU_NOTIFY_CLEAR notifier is certainly called.
> > However, we really need to have a 'adjust_range_if_pmd_sharing_possible'
> > call in there because the 'range' may be part of a shared pmd.  :(
> > 
> > I think we need to either have a separate routine like clear_hugetlb_page_range
> > that sets up the appropriate range, or special case hugetlb in zap_page_range.
> > What do you think?
> > I think clear_hugetlb_page_range is the least bad of the two options.
> 
> How about special case hugetlb as you mentioned?  If I'm not wrong, it
> should be 3 lines change:
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index c5599a9279b1..0a1632e44571 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1706,11 +1706,13 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
>         lru_add_drain();
>         mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
>                                 start, start + size);
> +       if (is_vm_hugetlb_page(vma))
> +               adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>         tlb_gather_mmu(&tlb, vma->vm_mm);
>         update_hiwater_rss(vma->vm_mm);
>         mmu_notifier_invalidate_range_start(&range);
>         do {
> -               unmap_single_vma(&tlb, vma, start, range.end, NULL);
> +               unmap_single_vma(&tlb, vma, start, start + size, NULL);
>         } while ((vma = mas_find(&mas, end - 1)) != NULL);
>         mmu_notifier_invalidate_range_end(&range);
>         tlb_finish_mmu(&tlb);
> ---8<---
> 
> As zap_page_range() is already vma-oriented anyway.  But maybe I missed
> something important?

zap_page_range is a bit confusing.  It appears that the passed range can
span multiple vmas.  Otherwise, there would be no do while loop.  Yet, there
is only one mmu_notifier_range_init call specifying the passed vma.

It appears all callers pass a range entirely within a single vma.

The modifications above would work for a range within a single vma.  However,
things would be more complicated if the range can indeed span multiple vmas.
For multiple vmas, we would need to check the first and last vmas for
pmd sharing.

Anyone know more about this seeming confusing behavior?  Perhaps, range
spanning multiple vmas was left over earlier code?
-- 
Mike Kravetz