> Yes, that might be useful feature. (Assuming I understood it correctly) > I believe > tcmalloc would likely use: > > mremap(old_ptr, move_size, move_size, > MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_NOHOLE, > new_ptr); > > as optimized equivalent of: > > memcpy(new_ptr, old_ptr, move_size); > madvise(old_ptr, move_size, MADV_DONTNEED); Yeah, it's essentially an optimized memcpy for when you don't need the source allocation anymore. > a) what is the smallest size where mremap is going to be faster ? There are probably a lot of variables here like the CPU design and the speed of system calls (syscall auditing makes them much slower!) in addition to the stuff you've pointed out. > My initial thinking was that we'd likely use mremap in all cases where > we know > that touching destination would cause minor page faults (i.e. when > destination > chunk was MADV_DONTNEED-ed or is brand new mapping). And then also > always when > size is large enough, i.e. because "teleporting" large count of pages is > likely > to be faster than copying them. > > But now I realize that it is more interesting than that. I.e. because as > Daniel > pointed out, mremap holds mmap_sem exclusively, while page faults are > holding it > for read. That could be optimized of course. Either by separate > "teleport ptes" > syscall (again, as noted by Daniel), or by having mremap drop mmap_sem > for write > and retaking it for read for "moving pages" part of work. Being not really > familiar with kernel code I have no idea if that's doable or not. But it > looks > like it might be quite important. I think it's doable but it would pessimize the case where the dest VMA isn't reusable. It would need to optimistically take the reader lock to find out and then drop it. However, userspace knows when this is surely going to work and could give it a hint. I have a good idea about what the *ideal* API for the jemalloc/tcmalloc case would be. It would be extremely specific though... they want the kernel to move pages from a source VMA to a destination VMA where both are anon/private with identical flags so only the reader lock is necessary. On top of that, they really want to keep around as many destination pages as possible, maybe by swapping as many as possible back to the source. That's *extremely* specific though and I now think the best way to get there is by landing this feature and then extending it as necessary down the road. An allocator may actually want to manage other kinds of mappings itself and it would want the mmap_sem optimization to be an optional hint. > And I confirm that with all default settings tcmalloc and jemalloc lose to > glibc. Also, notably, recent dev build of jemalloc (what is going to be 4.0 > AFAIK) actually matches or exceeds glibc speed, despite still not doing > mremap. Apparently it is smarter about avoiding moving allocation for those > realloc-s. And it was even able to resist my attempt to force it to move > allocation. I haven't investigated why. Note that I built it couple > weeks or so > ago from dev branch, so it might simply have bugs. I submitted patches teaching jemalloc to expand/shrink huge allocations in-place, so it's hitting the in-place resize path after the initial iteration on a repeated reallocation benchmark that's not doing any other allocations. In jemalloc, everything is allocated via naturally aligned chunks (4M before, recently down to 256k in master) so if you want to block in-place huge reallocation you'll either need to force a new non-huge chunk to be allocated or make one that's at least as large as the chunk size. I don't think in-place reallocation is very common in long-running programs. It's probably more common now that jemalloc is experimenting with first-fit for chunk/huge allocation rather than address-ordered best-fit. The best-fit algorithm is designed to keep the opportunity for in-place reallocation to a minimum, although address ordering does counter it :). > NOTE: TCMALLOC_AGGRESSIVE_DECOMMIT=t (and default since 2.4) makes tcmalloc > MADV_DONTNEED large free blocks immediately. As opposed to less rare with > setting of "false". And it makes big difference on page faults counts > and thus > on runtime. > > Another notable thing is how mlock effectively disables MADV_DONTNEED for > jemalloc{1,2} and tcmalloc, lowers page faults count and thus improves > runtime. It can be seen that tcmalloc+mlock on thp-less configuration is > slightly better on runtime to glibc. The later spends a ton of time in > kernel, > probably handling minor page faults, and the former burns cpu in user space > doing memcpy-s. So "tons of memcpys" seems to be competitive to what > glibc is > doing in this benchmark. When I taught jemalloc to use the MREMAP_RETAIN flag it was getting significant wins over glibc, so this might be caused by the time spent managing metadata, etc. > THP changes things however. Where apparently minor page faults become a lot > cheaper. Which makes glibc case a lot faster than even tcmalloc+mlock > case. So > in THP case, cost of page faults is smaller than cost of large memcpy. > > So results are somewhat mixed, but overall I'm not sure that I'm able to see > very convincing story for MREMAP_HOLE yet. However: > > 1) it is possible that I am missing something. If so, please, educate me. > > 2) if kernel implements this API, I'm going to use it in tcmalloc. > > P.S. benchmark results also seem to indicate that tcmalloc could do > something to > explicitly enable THP and maybe better adapt to it's presence. Perhaps > with some > collaboration with kernel, i.e. to prevent that famous delay-ful-ness which > causes people to disable THP. BTW, THP currently interacts very poorly with the jemalloc/tcmalloc madvise purging. The part where khugepaged assigns huge pages to dense spans of pages is *great*. The part where the kernel hands out a huge page on for a fault in a 2M span can be awful. It causes the model inside the allocator of uncommitted vs. committed pages to break down. For example, the allocator might use 1M of a huge page and then start purging. The purging will split it into 4k pages, so there will be 1M of zeroed 4k pages that are considered purged by the allocator. Over time, this can cripple purging. Search for "jemalloc huge pages" and you'll find lots of horror stories about this. I think a THP implementation playing that played well with purging would need to drop the page fault heuristic and rely on a significantly better khugepaged. This would mean faulting in a span of memory would no longer be faster. Having a flag to populate a range with madvise would help a lot though, since the allocator knows exactly how much it's going to clobber with the memcpy. There will still be a threshold where mremap gets significantly faster, but it would move it higher.
Attachment:
signature.asc
Description: OpenPGP digital signature