Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Wed, 26 Oct 2022 12:38:43 -0700

On 10/25/22 16:37, Rik van Riel wrote:
> Hi Mike,
> 
> After getting promising results initially, we discovered there
> is yet another bug left with hugetlbfs MADV_DONTNEED.
> 
> This one involves a page fault on a hugetlbfs address, while
> another thread in the same process is in the middle of MADV_DONTNEED
> on that same memory address.
> 
> The code in __unmap_hugepage_range() will clear the page table
> entry, and then at some point later the lazy TLB code will 
> actually free the huge page back into the hugetlbfs free page
> pool.
> 
> Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> will fail because the code calling __unmap_hugepage_range() has
> not actually returned the page to the free list yet.
> 
> The result is that the process gets killed with SIGBUS.
> 
> I have thought of a few different solutions to this problem, but
> none of them look good:
> - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
>   page faults. This could make MADV_DONTNEED on VMAs with 4kB
>   pages unacceptably slow.
> - Some sort of atomic counter kept by __unmap_hugepage_range()
>   that huge pages may be getting placed in the tlb gather, and
>   freed later by tlb_finish_mmu().  This would involve changes
>   to the MMU gather code, outside of hugetlbfs.
> - Some sort of generation counter that tracks tlb_gather_mmu
>   cycles in progress, with the alloc_huge_page failure path
>   waiting until all mmu gather operations that started before
>   it to finish, before retrying the allocation. This requires
>   changes to the generic code, outside of hugetlbfs.

I've had a little more time to think about this.  Sorry about the noise
yesterday evening!

My assumption is that you are still working with library code that gets
passed an address range and is not aware it is a hugetlb mapping.  Correct?

Part of the issue here is that hugetlb pages are a limited resource.  As
such we have this somewhat ugly 'kill with SIGBUS' behavior if we are
unable to allocate a hugetlb page for a page fault.  I would expect
applications that know they are dealing with hugetlb mappings and call
MADV_DONTNEED to at least be prepared for this situation.  I know that may
sound crazy, but let me explain why.

The first sentence in the description of MADV_DONTNEED says:
"Do  not  expect access in the near future.  (For the time being, the
 application is finished with the given range, so the  kernel can free
 resources associated with it.)"
As mentioned, hugetlb pages are a limited resource.  So, I would expect an
application to MADV_DONTNEED part of a hugetlb range so that the pages can
be used somewhere else.  If the application has need for pages in that range
later, it should not expect that those pages are available.  They would only
be available if anyone that may be using those pages have made them available.
Some type of coordination by application code is required to manage the
use the limited hugetlb pages.

In hugetlbfs there is the concept of 'reservations' to somewhat manage
the expectation that hugetlb pages are available for page faults.  I
suspect most people know that reservations are created at mmap time for
the size of the mapping.  The corresponding number of hugetlb pages are
reserved for the mapping.  As the mapping is populated, the number of
reservations is decremented.

Now consider a populated hugetlb mapping, and someone does a MADV_DONTNEED
on a page of that mapping.  When the page was originally faulted in, the
reservation associated with that address was consumed.  So, before the
MADV_DONTNEED there is no reservation for that specific address within
the mapping.  In addition, when the MADV_DONTNEED is performed, there is no
effort made to restore or create a reservation for the address.  I 'believe'
this is the desired behavior.  If a reservation would be created as the result
of MADV_DONTNEED, the freed page is not really 'free' as it is still reserved
for this address/mapping.

If we believe that MADV_DONTNEED truly frees the underlying page and does not
create a reservation, then getting a SIGBUS as the result of a subsequent
fault is certainly a possibility.  The situation you describe above results
in SIGBUS because the fault happens before the page hits the free list.
However, even if we hold off faults until the page hits the free list,
there is nothing to prevent someone else from taking the page off the free
list before the fault happens.  This would also result in SIGBUS.  Do
note that all this assumes there are no free hugetlb pages before the
MADV_DONTNEED.  This is why I think an application that knows it is
performing an MADV_DONTNEED on a hugetlb mapping should can not expect
with 100%certainty that subsequent faults on that range will succeed.

Of course, that does not help in this specific situation.  Since the
application/library code does not know that it is dealing with a hugetlb
mapping, the best thing would be for the MADV_DONTNEED not to succeed.
However, since MADV_DONTNEED for hugetlbfs was introduced there has been
some interest and desire for use.  Mostly this is around userfaultfd and live
migration.  So, I would really hate to just remove hugetlb MADV_DONTNEED
support.  I have cc'ed some of those people here.

> What are the reasonable alternatives here?
> 
> Should we see if anybody can come up with a simple solution
> to the problem, or would it be better to just disable
> MADV_DONTNEED on hugetlbfs for now?

Here is one thought (perhaps crazy).  Do not allow MADV_DONTNEED on
hugetlb mappings.  This would help with the library code passed hugetlb
mapping.  For the use cases where MADV_DONTNEED on hugetlb is desirable,
create ?MADV_DONTNEED_HUGETLB? that only operates on hugetlb mappings.
In this way the caller must be aware they are operating on a hugetlb
mapping.
-- 
Mike Kravetz