Re: [RFC v7 00/11] Support vrange for anonymous page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 12, 2013 at 12:38 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
>         ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
>         ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
>
> And it's much improved on KVM virtual machine.
>
> This patchset is based on v3.9-rc2
>
> - What's the sys_vrange(addr, length, mode, behavior)?
>
>   It's a hint that user deliver to kernel so kernel can *discard*
>   pages in a range anytime. mode is one of VRANGE_VOLATILE and
>   VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
>   kernel coudn't discard any pages any more while VRANGE_VOLATILE
>   is memory unpin opeartion so kernel can discard pages in vrange
>   anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
>   PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
>   discard page in a vrange, please, discard all of pages in a
>   vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
>   that please discard of some pages in a vrange. But now I didn't
>   implemented VRANGE_PARTIAL handling yet.
>
> - What happens if user access page(ie, virtual address) discarded
>   by kernel?
>
>   The user can encounter SIGBUS.
>
> - What should user do for avoding SIGBUS?
>   He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
>   accessing the range which was called
>   vrange(addr, length, VRANGE_VOLATILE, mode)
>
> - What happens if user access page(ie, virtual address) doesn't
>   discarded by kernel?
>
>   The user can see vaild data which was there before calling
> vrange(., VRANGE_VOLATILE) without page fault.
>
> - What's different with madvise(DONTNEED)?
>
>   System call semantic
>
>   DONTNEED makes sure user always can see zero-fill pages after
>   he calls madvise while vrange can see data or encounter SIGBUS.
>
>   Internal implementation
>
>   The madvise(DONTNEED) should zap all mapped pages in range so
>   overhead is increased linearly with the number of mapped pages.
>   Even, if user access zapped pages as write mode, page fault +
>   page allocation + memset should be happened.
>
>   The vrange just register a address range instead of zapping all of pte
>   n the vma so it doesn't touch ptes any more.
>
> - What's the benefit compared to DONTNEED?
>
>   1. The system call overhead is smaller because vrange just registers
>      a range using interval tree instead of zapping all the page in a range
>      so overhead should be really cheap.
>
>   2. It has a chance to eliminate overheads (ex, zapping pte + page fault
>      + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
>      severe.
>
>   3. It has a potential to zap all ptes and free the pages if memory
>      pressure is severe so discard scanning overhead could be smaller - TODO
>
> - What's for targetting?
>
>   Firstly, user-space allocator like ptmalloc, jemalloc or heap management
>   of virtual machine like Dalvik. Also, it comes in handy for embedded
>   which doesn't have swap device so they can't reclaim anonymous pages.
>   By discarding instead of swapout, it could be used in the non-swap system.

I think that another potentially useful use-case would be using this
-- or a similar API -- to opportunistically return deep user stack
frames.

This is another place where we strongly care about the time-to-free as
well as the time-to-reallocate in the case of relatively immediate
re-use.

>
> Changelog from v6 - There are many changes.
>  * Remove vma-based approach
>  * Change system call semantic
>  * Add more meaningful experiment
>
> Changelog from v5 - There are many changes.
>
>  * Support CONFIG_VOLATILE_PAGE
>  * Working with THP/KSM
>  * Remove vma hacking logic in m[no]volatile system call
>  * Discard page without swap cache
>  * Kswapd discard volatile page so we can discard volatile pages
>    although we don't have swap.
>
> Changelog from v4
>
>  * Add new system call mvolatile/mnovolatile
>  * Add sigbus when user try to access volatile range
>  * Rebased on v3.7
>  * Applied bug fix from John Stultz, Thanks!
>
> Changelog from v3
>
>  * Removing madvise(addr, length, MADV_NOVOLATILE).
>  * add vmstat about the number of discarded volatile pages
>  * discard volatile pages without promotion in reclaim path
>
> Minchan Kim (11):
>   vrange: enable generic interval tree
>   add vrange basic data structure and functions
>   add new system call vrange(2)
>   add proc/pid/vrange information
>   Add purge operation
>   send SIGBUS when user try to access purged page
>   keep mm_struct to vrange when system call context
>   add LRU handling for victim vrange
>   Get rid of depenceny that all pages is from a zone in shrink_page_list
>   Purging vrange pages without swap
>   add purged page information in vmstat
>
>  arch/x86/include/asm/pgtable_types.h   |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   1 +
>  fs/proc/base.c                         |   1 +
>  fs/proc/internal.h                     |   6 +
>  fs/proc/task_mmu.c                     | 129 ++++++
>  include/asm-generic/pgtable.h          |  11 +
>  include/linux/mm_types.h               |   5 +
>  include/linux/rmap.h                   |  15 +-
>  include/linux/swap.h                   |   1 +
>  include/linux/vm_event_item.h          |   4 +
>  include/linux/vrange.h                 |  59 +++
>  include/uapi/asm-generic/mman-common.h |   5 +
>  init/main.c                            |   2 +
>  kernel/fork.c                          |   3 +
>  lib/Makefile                           |   2 +-
>  mm/Makefile                            |   2 +-
>  mm/ksm.c                               |   2 +-
>  mm/memory.c                            |  24 +-
>  mm/rmap.c                              |  23 +-
>  mm/swapfile.c                          |  36 ++
>  mm/vmscan.c                            |  74 +++-
>  mm/vmstat.c                            |   4 +
>  mm/vrange.c                            | 754 +++++++++++++++++++++++++++++++++
>  23 files changed, 1143 insertions(+), 22 deletions(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
>
> --
> 1.8.1.1
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]