On 10/25/22 20:07, Mike Kravetz wrote: > On 10/25/22 16:37, Rik van Riel wrote: > > Hi Mike, > > > > After getting promising results initially, we discovered there > > is yet another bug left with hugetlbfs MADV_DONTNEED. > > > > This one involves a page fault on a hugetlbfs address, while > > another thread in the same process is in the middle of MADV_DONTNEED > > on that same memory address. > > > > The code in __unmap_hugepage_range() will clear the page table > > entry, and then at some point later the lazy TLB code will > > actually free the huge page back into the hugetlbfs free page > > pool. > > > > Meanwhile, hugetlb_no_page will call alloc_huge_page, and that > > will fail because the code calling __unmap_hugepage_range() has > > not actually returned the page to the free list yet. > > > > The result is that the process gets killed with SIGBUS. > > Thanks for the details Rik. That makes perfect sense. > > > I have thought of a few different solutions to this problem, but > > none of them look good: > > - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude > > page faults. This could make MADV_DONTNEED on VMAs with 4kB > > pages unacceptably slow. > > - Some sort of atomic counter kept by __unmap_hugepage_range() > > that huge pages may be getting placed in the tlb gather, and > > freed later by tlb_finish_mmu(). This would involve changes > > to the MMU gather code, outside of hugetlbfs. > > - Some sort of generation counter that tracks tlb_gather_mmu > > cycles in progress, with the alloc_huge_page failure path > > waiting until all mmu gather operations that started before > > it to finish, before retrying the allocation. This requires > > changes to the generic code, outside of hugetlbfs. > > > > What are the reasonable alternatives here? > > > > Should we see if anybody can come up with a simple solution > > to the problem, or would it be better to just disable > > MADV_DONTNEED on hugetlbfs for now? > > I am thinking that we could use the vma_lock to prevent faults on the > vma until the MADV_DONTNEED processing is totally complete. IIUC, it is > the tlb_finish_mmu call that actually drops the ref count on the pages > and returns them to the free list. Currently, we hold the vma_lock in > write mode during unmap, and acquire it in read mode during faults. > However, we drop it in the MADV_DONTNEED path (just a bit) before calling > tlb_finish_mmu. So, holding a bit longer may address this issue rather > simply. Sorry, that is wrong! The vma_lock only applies to sharable mappings and this is certainly about non-sharable mappings. Please disregard. I will continue to think about this and try to come up with a solution. If not, we may need to disable hugetlb MADV_DONTNEED. I really don't want to modify too much (if any) non-hugtlb code to accommodate this. -- Mike Kravetz