Hello Daniel, On Tue, Mar 24, 2015 at 10:39:32AM -0400, Daniel Micay wrote: > On 24/03/15 01:25 AM, Aliaksey Kandratsenka wrote: > > > > Well, I don't have any workloads. I'm just maintaining a library that > > others run various workloads on. Part of the problem is lack of good > > and varied malloc benchmarks which could allow us that prevent > > regression. So this makes me a bit more cautious on performance > > matters. > > > > But I see your point. Indeed I have no evidence at all that exclusive > > locking might cause observable performance difference. > > I'm sure it matters but I expect you'd need *many* cores running many > threads before it started to outweigh the benefit of copying pages > instead of data. > > Thinking about it a bit more, it would probably make sense for mremap to > start with the optimistic assumption that the reader lock is enough here > when using MREMAP_NOHOLE|MREMAP_FIXED. It only needs the writer lock if > the destination mapping is incomplete or doesn't match, which is an edge > case as holes would mean thread unsafety. > > An ideal allocator will toggle on PROT_NONE when overcommit is disabled > so this assumption would be wrong. The heuristic could just be adjusted > to assume the dest VMA will match with MREMAP_NOHOLE|MREMAP_FIXED when > full memory accounting isn't enabled. The fallback would never ended up > being needed in existing use cases that I'm aware of, and would just add > the overhead of a quick lock, O(log n) check and unlock with the reader > lock held anyway. Another flag isn't really necessary. > > >>> Another notable thing is how mlock effectively disables MADV_DONTNEED for > >>> jemalloc{1,2} and tcmalloc, lowers page faults count and thus improves > >>> runtime. It can be seen that tcmalloc+mlock on thp-less configuration is > >>> slightly better on runtime to glibc. The later spends a ton of time in > >>> kernel, > >>> probably handling minor page faults, and the former burns cpu in user space > >>> doing memcpy-s. So "tons of memcpys" seems to be competitive to what glibc > >>> is > >>> doing in this benchmark. > >> > >> mlock disables MADV_DONTNEED, so this is an unfair comparsion. With it, > >> allocator will use more memory than expected. > > > > Do not agree with unfair. I'm actually hoping MADV_FREE to provide > > most if not all of benefits of mlock in this benchmark. I believe it's > > not too unreasonable expectation. > > MADV_FREE will still result in as many page faults, just no zeroing. I didn't follow this thread. However, as you mentioned MADV_FREE will make many page fault, I jump into here. One of the benefit with MADV_FREE in current implementation is to avoid page fault as well as no zeroing. Why did you see many page fault? > > I get ~20k requests/s with jemalloc on the ebizzy benchmark with this > dual core ivy bridge laptop. It jumps to ~60k requests/s with MADV_FREE > IIRC, but disabling purging via MALLOC_CONF=lg_dirty_mult:-1 leads to > 3.5 *million* requests/s. It has a similar impact with TCMalloc. When I tested MADV_FREE with ebizzy, I saw similar result two or three times fater than MADV_DONTNEED. But It's no free cost. It incurs MADV_FREE cost itself*(ie, enumerating all of page table in the range and clear dirty bit and tlb flush). Of course, it has mmap_sem with read-side lock. If you see great improve when you disable purging, I guess mainly it's caused by no lock of mmap_sem so some threads can allocate while other threads can do page fault. The reason I think so is I saw similar result when I implemented vrange syscall which hold mmap_sem read-side lock during very short time(ie, marking the volatile into vma, ie O(1) while MADV_FREE holds a lock during enumerating all of pages in the range, ie O(N)) -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html