Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 25 Mar 2015 19:31:13 -0700 (PDT)

On Wed, 25 Mar 2015, Daniel Micay wrote:

> > With tcmalloc, it's simple to always expand the heap by mmaping 2MB ranges 
> > for size classes <= 2MB, allocate its own metadata from an arena that is 
> > also expanded in 2MB range, and always do madvise(MADV_DONTNEED) for the 
> > longest span on the freelist when it does periodic memory freeing back to 
> > the kernel, and even better if the freed memory splits at most one 
> > hugepage.  When memory is pulled from the freelist of memory that has 
> > already been returned to the kernel, you can return a span that will make 
> > it eligible to be collapsed into a hugepage based on your setting of 
> > max_ptes_none, trying to consolidate the memory as much as possible.  If 
> > your malloc is implemented in a way to understand the benefit of 
> > hugepages, and how much memory you're willing to sacrifice (max_ptes_none) 
> > for it, then you should _never_ be increasing memory usage by 50%.
> 
> If khugepaged was the only source of huge pages, sure. The primary
> source of huge pages is the heuristic handing out an entire 2M page on
> the first page fault in a 2M range.
> 

The behavior is a property of what you brk() or mmap() to expand your 
heap, you can intentionally require it to fault hugepages or not fault 
hugepages without any special madvise().

With the example above, the implementation I wrote specifically tries to 
sbrk() in 2MB regions and hands out allocator metadata via a memory arena 
doing the same thing.  Memory is treated as being on a normal freelist so 
that it is considered resident, i.e. the same as faulting 4KB, freeing it, 
before tcmalloc does madvise(MADV_DONTNEED), and we naturally prefer to 
hand that out before going to the returned freelist or mmap() as fallback.  
There will always be fragmentation in your normal freelist spans, so 
there's always wasted memory (with or without thp).  There should never be 
a case where you're always mapping 2MB aligned regions and then only 
touching a small portion of it, for >2MB size classes you could easily map 
only the size required and you would never get an excess of memory due to 
thp at fault.

I think this may be tangential to the thread, though, since this has 
nothing to do with mremap() or any new mremap() flag.

If the thp faulting behavior is going to be changed, then it would need to 
be something that is opted into and not by any system tunable or madvise() 
flag.  It would probably need to be a prctl() like PR_SET_THP_DISABLE is 
that would control only fault behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html