On 10/25/22 16:37, Rik van Riel wrote: > Hi Mike, > > After getting promising results initially, we discovered there > is yet another bug left with hugetlbfs MADV_DONTNEED. > > This one involves a page fault on a hugetlbfs address, while > another thread in the same process is in the middle of MADV_DONTNEED > on that same memory address. > > The code in __unmap_hugepage_range() will clear the page table > entry, and then at some point later the lazy TLB code will > actually free the huge page back into the hugetlbfs free page > pool. > > Meanwhile, hugetlb_no_page will call alloc_huge_page, and that > will fail because the code calling __unmap_hugepage_range() has > not actually returned the page to the free list yet. > > The result is that the process gets killed with SIGBUS. > > I have thought of a few different solutions to this problem, but > none of them look good: > - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude > page faults. This could make MADV_DONTNEED on VMAs with 4kB > pages unacceptably slow. > - Some sort of atomic counter kept by __unmap_hugepage_range() > that huge pages may be getting placed in the tlb gather, and > freed later by tlb_finish_mmu(). This would involve changes > to the MMU gather code, outside of hugetlbfs. > - Some sort of generation counter that tracks tlb_gather_mmu > cycles in progress, with the alloc_huge_page failure path > waiting until all mmu gather operations that started before > it to finish, before retrying the allocation. This requires > changes to the generic code, outside of hugetlbfs. I've had a little more time to think about this. Sorry about the noise yesterday evening! My assumption is that you are still working with library code that gets passed an address range and is not aware it is a hugetlb mapping. Correct? Part of the issue here is that hugetlb pages are a limited resource. As such we have this somewhat ugly 'kill with SIGBUS' behavior if we are unable to allocate a hugetlb page for a page fault. I would expect applications that know they are dealing with hugetlb mappings and call MADV_DONTNEED to at least be prepared for this situation. I know that may sound crazy, but let me explain why. The first sentence in the description of MADV_DONTNEED says: "Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)" As mentioned, hugetlb pages are a limited resource. So, I would expect an application to MADV_DONTNEED part of a hugetlb range so that the pages can be used somewhere else. If the application has need for pages in that range later, it should not expect that those pages are available. They would only be available if anyone that may be using those pages have made them available. Some type of coordination by application code is required to manage the use the limited hugetlb pages. In hugetlbfs there is the concept of 'reservations' to somewhat manage the expectation that hugetlb pages are available for page faults. I suspect most people know that reservations are created at mmap time for the size of the mapping. The corresponding number of hugetlb pages are reserved for the mapping. As the mapping is populated, the number of reservations is decremented. Now consider a populated hugetlb mapping, and someone does a MADV_DONTNEED on a page of that mapping. When the page was originally faulted in, the reservation associated with that address was consumed. So, before the MADV_DONTNEED there is no reservation for that specific address within the mapping. In addition, when the MADV_DONTNEED is performed, there is no effort made to restore or create a reservation for the address. I 'believe' this is the desired behavior. If a reservation would be created as the result of MADV_DONTNEED, the freed page is not really 'free' as it is still reserved for this address/mapping. If we believe that MADV_DONTNEED truly frees the underlying page and does not create a reservation, then getting a SIGBUS as the result of a subsequent fault is certainly a possibility. The situation you describe above results in SIGBUS because the fault happens before the page hits the free list. However, even if we hold off faults until the page hits the free list, there is nothing to prevent someone else from taking the page off the free list before the fault happens. This would also result in SIGBUS. Do note that all this assumes there are no free hugetlb pages before the MADV_DONTNEED. This is why I think an application that knows it is performing an MADV_DONTNEED on a hugetlb mapping should can not expect with 100%certainty that subsequent faults on that range will succeed. Of course, that does not help in this specific situation. Since the application/library code does not know that it is dealing with a hugetlb mapping, the best thing would be for the MADV_DONTNEED not to succeed. However, since MADV_DONTNEED for hugetlbfs was introduced there has been some interest and desire for use. Mostly this is around userfaultfd and live migration. So, I would really hate to just remove hugetlb MADV_DONTNEED support. I have cc'ed some of those people here. > What are the reasonable alternatives here? > > Should we see if anybody can come up with a simple solution > to the problem, or would it be better to just disable > MADV_DONTNEED on hugetlbfs for now? Here is one thought (perhaps crazy). Do not allow MADV_DONTNEED on hugetlb mappings. This would help with the library code passed hugetlb mapping. For the use cases where MADV_DONTNEED on hugetlb is desirable, create ?MADV_DONTNEED_HUGETLB? that only operates on hugetlb mappings. In this way the caller must be aware they are operating on a hugetlb mapping. -- Mike Kravetz