On 2/3/14, 5:31 PM, "Minchan Kim" <minchan@xxxxxxxxxx> wrote: >While I discuss with Johannes, I'm biasing to implemnt MADV_FREE for >Linux. >instead of vrange syscall for allocator. >The reason I preferred vrange syscall over MADV_FREE is vrange syscall >is almost O(1) so it's really light weight system call although it needs >one more syscall to unmark volatility while MADV_FREE is O(#pages) but >as Johannes pointed out, these day kernel trends are using huge pages(ex, >2M) so I guess the overhead is really big. > >(Another topic: If application want to use huge pages on Linux, >it should mmap the region is aligned to the huge page size but when >I read jemalloc source code, it seems not. Do you have any reason?) jemalloc uses 4 MiB naturally aligned chunks by default (chunk size can be any power of 2 that is at least two pages), so by default jemalloc does align its mappings to huge page boundaries. However, chunks have embedded metadata headers, which means that in practice, only the second half of each chunk can be madvise()d away if only huge pages are in use. Additionally, the overhead of using even one huge page per size class would be unacceptable for most applications (2 MiB * ~30 size classes * number of active arenas), so adjusting the allocator's layout algorithms to use huge pages would require a very different strategy than is currently used, and the likelihood of having huge pages completely drain of allocations would be quite low. On top of that, the implicit nature of transparent huge pages makes them difficult to reliably account for in userland. In other words, huge pages and explicit dirty page purging are for most practical purposes incompatible. >As a bonus point, many allocators already has a logic to use MADV_FREE >so it's really easy to use it if Linux start to support it. MADV_FREE is certainly an easy interface to use, and as long as there aren't any serious scalability issues in the implementation (e.g. concurrent madvise() calls for disjoint virtual addresses from multiple threads should be contention-free), I think it's perfectly adequate. >Do you see other point that light-weight vrange syscall is >superior to MADV_FREE of big chunk all at once? Other than system call overhead, volatile ranges and MADV_FREE are both great for jemalloc's purposes. MADV_FREE is a bit easier to deal with, mainly because volatile ranges are distinct from dirty pages and virtual memory coalescing in jemalloc will require some additional work to logically treat adjacent volatile/dirty ranges as contiguous, but that's a solvable problem. Thanks, Jason -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href