Re: [PATCH v5 0/3] hugetlbfs: close race between MADV_DONTNEED and page fault

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Sat, 30 Sep 2023 19:54:34 -0700

On Sat, 30 Sep 2023 20:55:47 -0400 riel@xxxxxxxxxxx wrote:

> v5: somehow a __vma_private_lock(vma) test failed to make it from my tree into the v4 series, fix that
> v4: fix unmap_vmas locking issue pointed out by Mike Kravetz, and resulting lockdep fallout
> v3: fix compile error w/ lockdep and test case errors with patch 3
> v2: fix the locking bug found with the libhugetlbfs tests.
> 
> Malloc libraries, like jemalloc and tcalloc, take decisions on when
> to call madvise independently from the code in the main application.
> 
> This sometimes results in the application page faulting on an address,
> right after the malloc library has shot down the backing memory with
> MADV_DONTNEED.
> 
> Usually this is harmless, because we always have some 4kB pages
> sitting around to satisfy a page fault. However, with hugetlbfs
> systems often allocate only the exact number of huge pages that
> the application wants.
> 
> Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
> any lock taken on the page fault path, which can open up the following
> race condition:
> 
>        CPU 1                            CPU 2
> 
>        MADV_DONTNEED
>        unmap page
>        shoot down TLB entry
>                                        page fault
>                                        fail to allocate a huge page
>                                        killed with SIGBUS
>        free page
> 
> Fix that race by extending the hugetlb_vma_lock locking scheme to also
> cover private hugetlb mappings (with resv_map), and pulling the locking 
> from __unmap_hugepage_final_range into helper functions called from
> zap_page_range_single. This ensures page faults stay locked out of
> the MADV_DONTNEED VMA until the huge pages have actually been freed.

Didn't we decide that [1/3] and [2/3] should be cc:stable?

> The third patch in the series is more of an RFC. Using the
> invalidate_lock instead of the hugetlb_vma_lock greatly simplifies
> the code, but at the cost of turning a per-VMA lock into a lock
> per backing hugetlbfs file, which could slow things down when
> multiple processes are mapping the same hugetlbfs file.

"could slow things down" is testable-for?

This third one I'd queue up for testing for a 6.7-rc1 merge, so I'll split
the series apart.  Not a problem, but it would be a little better if
things were originally packaged that way.