Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/22/2015 08:22 AM, Daniel Micay wrote:
BTW, THP currently interacts very poorly with the jemalloc/tcmalloc
madvise purging. The part where khugepaged assigns huge pages to dense
spans of pages is*great*. The part where the kernel hands out a huge
page on for a fault in a 2M span can be awful. It causes the model
inside the allocator of uncommitted vs. committed pages to break down.

For example, the allocator might use 1M of a huge page and then start
purging. The purging will split it into 4k pages, so there will be 1M of
zeroed 4k pages that are considered purged by the allocator. Over time,
this can cripple purging. Search for "jemalloc huge pages" and you'll
find lots of horror stories about this.

I'm not sure I get your description right. The problem I know about is where "purging" means madvise(MADV_DONTNEED) and khugepaged later collapses a new hugepage that will repopulate the purged parts, increasing the memory usage. One can limit this via /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none . That setting doesn't affect the page fault THP allocations, which however happen only in newly accessed hugepage-sized areas and not partially purged ones, though.

I think a THP implementation playing that played well with purging would
need to drop the page fault heuristic and rely on a significantly better
khugepaged.

See here http://lwn.net/Articles/636162/ (the "Compaction" part)

The objection is that some short-lived workloads like gcc have to map hugepages immediately if they are to benefit from them. I still plan to improve khugepaged and allow admins to say that they don't want THP page faults (and rely solely on khugepaged which has more information to judge additional memory usage), but I'm not sure if it would be an acceptable default behavior. One workaround in the current state for jemalloc and friends could be to use madvise(MADV_NOHUGEPAGE) on hugepage-sized/aligned areas where it wants to purge parts of them via madvise(MADV_DONTNEED). It could mean overhead of another syscall and tracking of where this was applied and when it makes sense to undo this and allow THP to be collapsed again, though, and it would also split vma's.

This would mean faulting in a span of memory would no longer
be faster. Having a flag to populate a range with madvise would help a

If it's a newly mapped memory, there's mmap(MAP_POPULATE). There is also a madvise(MADV_WILLNEED), which sounds like what you want, but I don't know what the implementation does exactly - it was apparently added for paging in ahead, and maybe it ignores unpopulated anonymous areas, but it would probably be well in spirit of the flag to make it prepopulate those.

lot though, since the allocator knows exactly how much it's going to
clobber with the memcpy. There will still be a threshold where mremap
gets significantly faster, but it would move it higher.

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux