Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend

Daniel Micay <danielmicay@xxxxxxxxx> · Wed, 25 Mar 2015 20:24:42 -0400

On 25/03/15 08:19 PM, David Rientjes wrote:
> On Wed, 25 Mar 2015, Daniel Micay wrote:
> 
>>> I'm not sure I get your description right. The problem I know about is
>>> where "purging" means madvise(MADV_DONTNEED) and khugepaged later
>>> collapses a new hugepage that will repopulate the purged parts,
>>> increasing the memory usage. One can limit this via
>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none . That
>>> setting doesn't affect the page fault THP allocations, which however
>>> happen only in newly accessed hugepage-sized areas and not partially
>>> purged ones, though.
>>
>> Since jemalloc doesn't unmap memory but instead does recycling itself in
>> userspace, it ends up with large spans of free virtual memory and gets
>> *lots* of huge pages from the page fault heuristic. It keeps track of
>> active vs. dirty (not purged) vs. clean (purged / untouched) ranges
>> everywhere, and will purge dirty ranges as they build up.
>>
>> The THP allocation on page faults mean it ends up with memory that's
>> supposed to be clean but is really not.
>>
>> A worst case example with the (up until recently) default chunk size of
>> 4M is allocating a bunch of 2.1M allocations. Chunks are naturally
>> aligned, so each one can be represented as 2 huge pages. It increases
>> memory usage by nearly *50%*. The allocator thinks the tail is clean
>> memory, but it's not. When the allocations are freed, it will purge the
>> 2.1M at the head (once enough dirty memory builds up) but all of the
>> tail memory will be leaked until something else is allocated there and
>> then freed.
>>
> 
> With tcmalloc, it's simple to always expand the heap by mmaping 2MB ranges 
> for size classes <= 2MB, allocate its own metadata from an arena that is 
> also expanded in 2MB range, and always do madvise(MADV_DONTNEED) for the 
> longest span on the freelist when it does periodic memory freeing back to 
> the kernel, and even better if the freed memory splits at most one 
> hugepage.  When memory is pulled from the freelist of memory that has 
> already been returned to the kernel, you can return a span that will make 
> it eligible to be collapsed into a hugepage based on your setting of 
> max_ptes_none, trying to consolidate the memory as much as possible.  If 
> your malloc is implemented in a way to understand the benefit of 
> hugepages, and how much memory you're willing to sacrifice (max_ptes_none) 
> for it, then you should _never_ be increasing memory usage by 50%.

If khugepaged was the only source of huge pages, sure. The primary
source of huge pages is the heuristic handing out an entire 2M page on
the first page fault in a 2M range.

Attachment:
signature.asc

Description: OpenPGP digital signature