First of all, let's define the term. >From now on, I'd like to call it as vrange(a.k.a volatile range) for anonymous page. If you have a better name in mind, please suggest. This version is still *RFC* because it's just quick prototype so it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86. Before further sorting out issues, I'd like to post current direction and discuss it. Of course, I'd like to extend this discussion in comming LSF/MM. In this version, I changed lots of thing, expecially removed vma-based approach because it needs write-side lock for mmap_sem, which will drop performance in mutli-threaded big SMP system, KOSAKI pointed out. And vma-based approach is hard to meet requirement of new system call by John Stultz's suggested semantic for consistent purged handling. (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none) I tested this patchset with modified jemalloc allocator which was leaded by Jason Evans(jemalloc author) who was interest in this feature and was happy to port his allocator to use new system call. Super Thanks Jason! The benchmark for test is ebizzy. It have been used for testing the allocator performance so it's good for me. Again, thanks for recommending the benchmark, Jason. (http://people.freebsd.org/~kris/scaling/ebizzy.html) The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G) ebizzy -S 20 jemalloc-vanilla: 52389 records/sec jemalloc-vrange: 203414 records/sec ebizzy -S 20 with background memory pressure jemalloc-vanilla: 40746 records/sec jemalloc-vrange: 174910 records/sec And it's much improved on KVM virtual machine. This patchset is based on v3.9-rc2 - What's the sys_vrange(addr, length, mode, behavior)? It's a hint that user deliver to kernel so kernel can *discard* pages in a range anytime. mode is one of VRANGE_VOLATILE and VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so kernel coudn't discard any pages any more while VRANGE_VOLATILE is memory unpin opeartion so kernel can discard pages in vrange anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE PARTIAL. VRANGE_FULL tell kernel that once kernel decide to discard page in a vrange, please, discard all of pages in a vrange selected by victim vrange. VRANGE_PARTIAL tell kernel that please discard of some pages in a vrange. But now I didn't implemented VRANGE_PARTIAL handling yet. - What happens if user access page(ie, virtual address) discarded by kernel? The user can encounter SIGBUS. - What should user do for avoding SIGBUS? He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before accessing the range which was called vrange(addr, length, VRANGE_VOLATILE, mode) - What happens if user access page(ie, virtual address) doesn't discarded by kernel? The user can see vaild data which was there before calling vrange(., VRANGE_VOLATILE) without page fault. - What's different with madvise(DONTNEED)? System call semantic DONTNEED makes sure user always can see zero-fill pages after he calls madvise while vrange can see data or encounter SIGBUS. Internal implementation The madvise(DONTNEED) should zap all mapped pages in range so overhead is increased linearly with the number of mapped pages. Even, if user access zapped pages as write mode, page fault + page allocation + memset should be happened. The vrange just register a address range instead of zapping all of pte n the vma so it doesn't touch ptes any more. - What's the benefit compared to DONTNEED? 1. The system call overhead is smaller because vrange just registers a range using interval tree instead of zapping all the page in a range so overhead should be really cheap. 2. It has a chance to eliminate overheads (ex, zapping pte + page fault + page allocation + memset(PAGE_SIZE)) if memory pressure isn't severe. 3. It has a potential to zap all ptes and free the pages if memory pressure is severe so discard scanning overhead could be smaller - TODO - What's for targetting? Firstly, user-space allocator like ptmalloc, jemalloc or heap management of virtual machine like Dalvik. Also, it comes in handy for embedded which doesn't have swap device so they can't reclaim anonymous pages. By discarding instead of swapout, it could be used in the non-swap system. Changelog from v6 - There are many changes. * Remove vma-based approach * Change system call semantic * Add more meaningful experiment Changelog from v5 - There are many changes. * Support CONFIG_VOLATILE_PAGE * Working with THP/KSM * Remove vma hacking logic in m[no]volatile system call * Discard page without swap cache * Kswapd discard volatile page so we can discard volatile pages although we don't have swap. Changelog from v4 * Add new system call mvolatile/mnovolatile * Add sigbus when user try to access volatile range * Rebased on v3.7 * Applied bug fix from John Stultz, Thanks! Changelog from v3 * Removing madvise(addr, length, MADV_NOVOLATILE). * add vmstat about the number of discarded volatile pages * discard volatile pages without promotion in reclaim path Minchan Kim (11): vrange: enable generic interval tree add vrange basic data structure and functions add new system call vrange(2) add proc/pid/vrange information Add purge operation send SIGBUS when user try to access purged page keep mm_struct to vrange when system call context add LRU handling for victim vrange Get rid of depenceny that all pages is from a zone in shrink_page_list Purging vrange pages without swap add purged page information in vmstat arch/x86/include/asm/pgtable_types.h | 2 + arch/x86/syscalls/syscall_64.tbl | 1 + fs/proc/base.c | 1 + fs/proc/internal.h | 6 + fs/proc/task_mmu.c | 129 ++++++ include/asm-generic/pgtable.h | 11 + include/linux/mm_types.h | 5 + include/linux/rmap.h | 15 +- include/linux/swap.h | 1 + include/linux/vm_event_item.h | 4 + include/linux/vrange.h | 59 +++ include/uapi/asm-generic/mman-common.h | 5 + init/main.c | 2 + kernel/fork.c | 3 + lib/Makefile | 2 +- mm/Makefile | 2 +- mm/ksm.c | 2 +- mm/memory.c | 24 +- mm/rmap.c | 23 +- mm/swapfile.c | 36 ++ mm/vmscan.c | 74 +++- mm/vmstat.c | 4 + mm/vrange.c | 754 +++++++++++++++++++++++++++++++++ 23 files changed, 1143 insertions(+), 22 deletions(-) create mode 100644 include/linux/vrange.h create mode 100644 mm/vrange.c -- 1.8.1.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>