Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend

Daniel Micay <danielmicay@xxxxxxxxx> · Tue, 24 Mar 2015 10:39:32 -0400

On 24/03/15 01:25 AM, Aliaksey Kandratsenka wrote:
> 
> Well, I don't have any workloads. I'm just maintaining a library that
> others run various workloads on. Part of the problem is lack of good
> and varied malloc benchmarks which could allow us that prevent
> regression. So this makes me a bit more cautious on performance
> matters.
> 
> But I see your point. Indeed I have no evidence at all that exclusive
> locking might cause observable performance difference.

I'm sure it matters but I expect you'd need *many* cores running many
threads before it started to outweigh the benefit of copying pages
instead of data.

Thinking about it a bit more, it would probably make sense for mremap to
start with the optimistic assumption that the reader lock is enough here
when using MREMAP_NOHOLE|MREMAP_FIXED. It only needs the writer lock if
the destination mapping is incomplete or doesn't match, which is an edge
case as holes would mean thread unsafety.

An ideal allocator will toggle on PROT_NONE when overcommit is disabled
so this assumption would be wrong. The heuristic could just be adjusted
to assume the dest VMA will match with MREMAP_NOHOLE|MREMAP_FIXED when
full memory accounting isn't enabled. The fallback would never ended up
being needed in existing use cases that I'm aware of, and would just add
the overhead of a quick lock, O(log n) check and unlock with the reader
lock held anyway. Another flag isn't really necessary.

>>> Another notable thing is how mlock effectively disables MADV_DONTNEED for
>>> jemalloc{1,2} and tcmalloc, lowers page faults count and thus improves
>>> runtime. It can be seen that tcmalloc+mlock on thp-less configuration is
>>> slightly better on runtime to glibc. The later spends a ton of time in
>>> kernel,
>>> probably handling minor page faults, and the former burns cpu in user space
>>> doing memcpy-s. So "tons of memcpys" seems to be competitive to what glibc
>>> is
>>> doing in this benchmark.
>>
>> mlock disables MADV_DONTNEED, so this is an unfair comparsion. With it,
>> allocator will use more memory than expected.
> 
> Do not agree with unfair. I'm actually hoping MADV_FREE to provide
> most if not all of benefits of mlock in this benchmark. I believe it's
> not too unreasonable expectation.

MADV_FREE will still result in as many page faults, just no zeroing.

I get ~20k requests/s with jemalloc on the ebizzy benchmark with this
dual core ivy bridge laptop. It jumps to ~60k requests/s with MADV_FREE
IIRC, but disabling purging via MALLOC_CONF=lg_dirty_mult:-1 leads to
3.5 *million* requests/s. It has a similar impact with TCMalloc.

>> I'm kind of confused why we talk about THP, mlock here. When application
>> uses allocator, it doesn't need to be forced to use THP or mlock. Can we
>> forcus on normal case?
> 
> See my note on mlock above.
> 
> THP it is actually "normal". I know for certain, that many production
> workloads are run on boxes with THP enabled. Red Hat famously ships
> it's distros with THP set to "always". And I also know that some other
> many production workloads are run on boxes with THP disabled. Also, as
> seen above, "teleporting" pages is more efficient with THP due to much
> smaller overhead of moving those pages. So I felt it was important not
> to omit THP in my runs.

Yeah, it's quite normal for it to be enabled. Allocators might as well
give up on fine-grained purging when it is though :P. I think it only
really makes sense to purge at 2M boundaries in multiples of 2M if it's
going to end up breaking any other purging over the long-term.

I was originally only testing with THP since Arch uses "always" but I
realized it had an enormous impact and started testing without it too.

Attachment:
signature.asc

Description: OpenPGP digital signature