On Tue, 2023-09-26 at 14:15 -0700, Andrew Morton wrote: > On Mon, 25 Sep 2023 23:10:51 -0400 riel@xxxxxxxxxxx wrote: > > > From: Rik van Riel <riel@xxxxxxxxxxx> > > > > Malloc libraries, like jemalloc and tcalloc, take decisions on when > > to call madvise independently from the code in the main > > application. > > > > This sometimes results in the application page faulting on an > > address, > > right after the malloc library has shot down the backing memory > > with > > MADV_DONTNEED. > > > > Usually this is harmless, because we always have some 4kB pages > > sitting around to satisfy a page fault. However, with hugetlbfs > > systems often allocate only the exact number of huge pages that > > the application wants. > > > > Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages > > outside of > > any lock taken on the page fault path, which can open up the > > following > > race condition: > > > > CPU 1 CPU 2 > > > > MADV_DONTNEED > > unmap page > > shoot down TLB entry > > page fault > > fail to allocate a huge page > > killed with SIGBUS > > free page > > > > Fix that race by pulling the locking from > > __unmap_hugepage_final_range > > into helper functions called from zap_page_range_single. This > > ensures > > page faults stay locked out of the MADV_DONTNEED VMA until the > > huge pages have actually been freed. > > > > Was a -stable backport considered? > That's a good idea. I'll have to see how far back the hugetlb_vma_*_lock stuff exists. We probably don't want to backport all the required infrastructure everywhere. -- All Rights Reversed.