Hi, On Thu, Jan 03, 2013 at 10:35:24AM -0800, Taras Glek wrote: > On 1/2/2013 8:27 PM, Minchan Kim wrote: > >This patch adds new system call m[no]volatile. > >If someone asks is_volatile system call, it could be added, too. > > > >The reason why I introduced new system call instead of madvise is > >m[no]volatile vma handling is totally different with madvise's vma > >handling. > > > >1) The m[no]volatile should be successful although the range includes > > unmapped or non-volatile range. It just skips such range > > without stopping with returning error although it encounters > > invalid range. It makes user convenient without calling several > > system call of small range - Suggested by John Stultz > > > >2) The purged state of volatile range should be propagated out to user > > although the range is merged with adjacent non-volatile range when > > user calls mnovolatile. > > > >3) mvolatile's interface could be changed with madvise > > in future discussion. For example, I feel needs > > movlatile(start, len, mode). > > 'mode' means FULL_VOLATILE or PARTIAL_VOLATILE. > > FULL volatile means that if VM decide to reclaim the range, it would > > reclaim all of pages in the range but in case of PARTIAL_VOLATILE, > > VM could reclaim just a few number of pages in the range. > > In case of tmpfs-volatile, user may regenerate all images data once > > one of page in the range is discarded so there is pointless that > > VM discard a page in the range when memory pressure is severe. > > In case of anon-volatile, too excess discarding cause too many minor > > fault for the allocator so it would be better to discard part of > > the range. > I don't understand point 3). > Are you saying that using mvolatile in conjuction with madvise could > allow mvolatile behavior to be tweaked in the future? Or are you > suggesting adding an extra parameter in the future(what would that > have to do with madvise)? I meant I might want to expand mvolatile's interface like below during discussion. int mvolatile(start, len, mode); > > 4) Having a new system call makes it easier for userspace apps to > detect kernels without this functionality. I coudn't understand your claim. Now mvolatile just return EINVAL on !CONFIG_VOLATILE_PAGE system. Why is it easy compared to returning EINVAL when we call madvise(VOLATILE) on !CONFIG_VOLATILE_PAGE? > > I really like the proposed interface. I like the suggestion of Thanks. > having explicit FULL|PARTIAL_VOLATILE. Why not include > PARTIAL_VOLATILE as a required 3rd param in first version with > expectation that FULL_VOLATILE will be added later(and returning > some not-supported error in meantime)? I just wanted to discuss about needs of it. The reason I need PARTIAL_VOLATILE is that avoids many minor fault for allocator. Is it useful for tmpfs-volatile, too? Thanks for the feedback, Taras. > > > >3) The mvolatile system call's return value is quite different with > > madvise. Look at below semantic explanation. > > > >So I want to separate mvolatile from madvise. > > > >mvolatile(start, len)'s semantics > > > >1) It makes range(start, len) as volatile although the range includes > >unmapped area, speacial mapping and mlocked area which are just skipped. > > > >Return -EINVAL if range doesn't include a right vma at all. > >Return -ENOMEM with interrupting range opeartion if memory is not > >enough to merge/split vmas. In this case, some ranges would be > >volatile and others not so user may recall mvolatile after he > >cancel all range by mnovolatile. > >Return 0 if range consists of only proper vmas. > >Return 1 if part of range includes hole/huge/ksm/mlock/special area. > > > >2) If user calls mvolatile to the range which was already volatile VMA and > >even purged state, VOLATILE attributes still remains but purged state > >is reset. I expect some user want to split volatile vma into smaller > >ranges. Although he can do it for mnovlatile(whole range) and serveral calling > >with movlatile(smaller range), this function can avoid mnovolatile if he > >doesn't care purged state. I'm not sure we really need this function so > >I hope listen opinions. Unfortunately, current implemenation doesn't split > >volatile VMA with new range in this case. I forgot implementing it > >in this version but decide to send it to listen opinions because > >implementing is rather trivial if we decided. > > > >mnovolatile(start, len)'s semantics is following as. > > > >1) It makes range(start, len) as non-volatile although the range > >includes unmapped area, speacial mapping and non-volatile range > >which are just skipped. > > > >2) If the range is purged, it will return 1 regardless of including > >invalid range. > If I understand this correctly: > mvolatile(0, 10); > //then range [9,10] is purged by kernel > mnovolatile(0,4) will fail? > that seems counterintuitive. > > One of the uses for mnovolatile is to atomicly lock the pages(vs a > racy proposed is_volatile) syscall. Above situation would make it > less effective. > > > > > >3) It returns -ENOMEM if system doesn't have enough memory for vma operation. > > > >4) It returns -EINVAL if range doesn't include a right vma at all. > > > >5) If user try to access purged range without mnovoatile call, it encounters > >SIGBUS which would show up next patch. > > > >Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx> > >Cc: Arun Sharma <asharma@xxxxxx> > >Cc: sanjay@xxxxxxxxxx > >Cc: Paul Turner <pjt@xxxxxxxxxx> > >CC: David Rientjes <rientjes@xxxxxxxxxx> > >Cc: John Stultz <john.stultz@xxxxxxxxxx> > >Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > >Cc: Christoph Lameter <cl@xxxxxxxxx> > >Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx> > >Cc: Robert Love <rlove@xxxxxxxxxx> > >Cc: Mel Gorman <mel@xxxxxxxxx> > >Cc: Hugh Dickins <hughd@xxxxxxxxxx> > >Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> > >Cc: Rik van Riel <riel@xxxxxxxxxx> > >Cc: Dave Chinner <david@xxxxxxxxxxxxx> > >Cc: Neil Brown <neilb@xxxxxxx> > >Cc: Mike Hommey <mh@xxxxxxxxxxxx> > >Cc: Taras Glek <tglek@xxxxxxxxxxx> > >Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx> > >Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > >Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> > >--- > > arch/x86/syscalls/syscall_64.tbl | 2 + > > fs/exec.c | 4 +- > > include/linux/mm.h | 6 +- > > include/linux/mm_types.h | 4 + > > include/linux/mvolatile.h | 30 ++++ > > include/linux/syscalls.h | 2 + > > mm/Kconfig | 11 ++ > > mm/Makefile | 2 +- > > mm/madvise.c | 2 +- > > mm/mempolicy.c | 2 +- > > mm/mlock.c | 7 +- > > mm/mmap.c | 62 ++++++-- > > mm/mprotect.c | 3 +- > > mm/mremap.c | 2 +- > > mm/mvolatile.c | 312 ++++++++++++++++++++++++++++++++++++++ > > mm/rmap.c | 2 + > > 16 files changed, 427 insertions(+), 26 deletions(-) > > create mode 100644 include/linux/mvolatile.h > > create mode 100644 mm/mvolatile.c > > > >diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl > >index a582bfe..568d488 100644 > >--- a/arch/x86/syscalls/syscall_64.tbl > >+++ b/arch/x86/syscalls/syscall_64.tbl > >@@ -319,6 +319,8 @@ > > 310 64 process_vm_readv sys_process_vm_readv > > 311 64 process_vm_writev sys_process_vm_writev > > 312 common kcmp sys_kcmp > >+313 common mvolatile sys_mvolatile > >+314 common mnovolatile sys_mnovolatile > > # > > # x32-specific system call numbers start at 512 to avoid cache impact > >diff --git a/fs/exec.c b/fs/exec.c > >index 0039055..da677d1 100644 > >--- a/fs/exec.c > >+++ b/fs/exec.c > >@@ -594,7 +594,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) > > /* > > * cover the whole range: [new_start, old_end) > > */ > >- if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL)) > >+ if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL, NULL)) > > return -ENOMEM; > > /* > >@@ -628,7 +628,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) > > /* > > * Shrink the vma to just the new range. Always succeeds. > > */ > >- vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL); > >+ vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL, NULL); > > return 0; > > } > >diff --git a/include/linux/mm.h b/include/linux/mm.h > >index bcaab4e..4bb59f3 100644 > >--- a/include/linux/mm.h > >+++ b/include/linux/mm.h > >@@ -87,6 +87,7 @@ extern unsigned int kobjsize(const void *objp); > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ > > #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ > >+#define VM_VOLATILE 0x00001000 /* Pages could be discarede without swapout */ > > #define VM_LOCKED 0x00002000 > > #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ > >@@ -1411,11 +1412,12 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); > > /* mmap.c */ > > extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); > > extern int vma_adjust(struct vm_area_struct *vma, unsigned long start, > >- unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert); > >+ unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, > >+ bool *purged); > > extern struct vm_area_struct *vma_merge(struct mm_struct *, > > struct vm_area_struct *prev, unsigned long addr, unsigned long end, > > unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, > >- struct mempolicy *); > >+ struct mempolicy *, bool *purged); > > extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); > > extern int split_vma(struct mm_struct *, > > struct vm_area_struct *, unsigned long addr, int new_below); > >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >index 31f8a3a..1eaf458 100644 > >--- a/include/linux/mm_types.h > >+++ b/include/linux/mm_types.h > >@@ -275,6 +275,10 @@ struct vm_area_struct { > > #ifdef CONFIG_NUMA > > struct mempolicy *vm_policy; /* NUMA policy for the VMA */ > > #endif > >+#ifdef CONFIG_VOLATILE_PAGE > >+ /* True if more than a page in this vma is reclaimed. */ > >+ bool purged; /* Serialized by mmap_sem and anon_vma's mutex */ > >+#endif > > }; > > struct core_thread { > >diff --git a/include/linux/mvolatile.h b/include/linux/mvolatile.h > >new file mode 100644 > >index 0000000..cfb12b4 > >--- /dev/null > >+++ b/include/linux/mvolatile.h > >@@ -0,0 +1,30 @@ > >+#ifndef __LINUX_MVOLATILE_H > >+#define __LINUX_MVOLATILE_H > >+ > >+#include <linux/syscalls.h> > >+ > >+#ifdef CONFIG_VOLATILE_PAGE > >+static inline bool vma_purged(struct vm_area_struct *vma) > >+{ > >+ return vma->purged; > >+} > >+ > >+static inline void vma_purge_copy(struct vm_area_struct *dst, > >+ struct vm_area_struct *src) > >+{ > >+ dst->purged = src->purged; > >+} > >+#else > >+static inline bool vma_purged(struct vm_area_struct *vma) > >+{ > >+ return false; > >+} > >+ > >+static inline void vma_purge_copy(struct vm_area_struct *dst, > >+ struct vm_area_struct *src) > >+{ > >+ > >+} > >+#endif > >+#endif > >+ > >diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > >index 727f0cd..a8ded1c 100644 > >--- a/include/linux/syscalls.h > >+++ b/include/linux/syscalls.h > >@@ -470,6 +470,8 @@ asmlinkage long sys_munlock(unsigned long start, size_t len); > > asmlinkage long sys_mlockall(int flags); > > asmlinkage long sys_munlockall(void); > > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > >+asmlinkage long sys_mvolatile(unsigned long start, size_t len); > >+asmlinkage long sys_mnovolatile(unsigned long start, size_t len); > > asmlinkage long sys_mincore(unsigned long start, size_t len, > > unsigned char __user * vec); > >diff --git a/mm/Kconfig b/mm/Kconfig > >index a3f8ddd..30b24ba 100644 > >--- a/mm/Kconfig > >+++ b/mm/Kconfig > >@@ -355,6 +355,17 @@ choice > > benefit. > > endchoice > >+config VOLATILE_PAGE > >+ bool "Volatile Page Support" > >+ depends on MMU > >+ help > >+ Enabling this option adds the system calls mvolatile and mnovolatile > >+ which are for giving user's address space range to kernel so VM > >+ can discard pages of the range anytime instead swapout. This feature > >+ can enhance performance to certain application(ex, memory allocator, > >+ web browser's tmpfs pages) by reduce the number of minor fault and > >+ swap out. > >+ > > config CROSS_MEMORY_ATTACH > > bool "Cross Memory Support" > > depends on MMU > >diff --git a/mm/Makefile b/mm/Makefile > >index 6b025f8..1efb735 100644 > >--- a/mm/Makefile > >+++ b/mm/Makefile > >@@ -5,7 +5,7 @@ > > mmu-y := nommu.o > > mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ > > mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \ > >- vmalloc.o pagewalk.o pgtable-generic.o > >+ mvolatile.o vmalloc.o pagewalk.o pgtable-generic.o > > ifdef CONFIG_CROSS_MEMORY_ATTACH > > mmu-$(CONFIG_MMU) += process_vm_access.o > >diff --git a/mm/madvise.c b/mm/madvise.c > >index 03dfa5c..6ffad21 100644 > >--- a/mm/madvise.c > >+++ b/mm/madvise.c > >@@ -99,7 +99,7 @@ static long madvise_behavior(struct vm_area_struct * vma, > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > *prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma, > >- vma->vm_file, pgoff, vma_policy(vma)); > >+ vma->vm_file, pgoff, vma_policy(vma), NULL); > > if (*prev) { > > vma = *prev; > > goto success; > >diff --git a/mm/mempolicy.c b/mm/mempolicy.c > >index 4ea600d..9b1aa2d 100644 > >--- a/mm/mempolicy.c > >+++ b/mm/mempolicy.c > >@@ -675,7 +675,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start, > > ((vmstart - vma->vm_start) >> PAGE_SHIFT); > > prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags, > > vma->anon_vma, vma->vm_file, pgoff, > >- new_pol); > >+ new_pol, NULL); > > if (prev) { > > vma = prev; > > next = vma->vm_next; > >diff --git a/mm/mlock.c b/mm/mlock.c > >index f0b9ce5..e03523a 100644 > >--- a/mm/mlock.c > >+++ b/mm/mlock.c > >@@ -316,13 +316,14 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, > > int ret = 0; > > int lock = !!(newflags & VM_LOCKED); > >- if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) || > >- is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm)) > >+ if (newflags == vma->vm_flags || (vma->vm_flags & > >+ (VM_SPECIAL|VM_VOLATILE)) || is_vm_hugetlb_page(vma) || > >+ vma == get_gate_vma(current->mm)) > > goto out; /* don't set VM_LOCKED, don't count */ > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, > >- vma->vm_file, pgoff, vma_policy(vma)); > >+ vma->vm_file, pgoff, vma_policy(vma), NULL); > > if (*prev) { > > vma = *prev; > > goto success; > >diff --git a/mm/mmap.c b/mm/mmap.c > >index 9a796c4..ba636c3 100644 > >--- a/mm/mmap.c > >+++ b/mm/mmap.c > >@@ -31,6 +31,7 @@ > > #include <linux/audit.h> > > #include <linux/khugepaged.h> > > #include <linux/uprobes.h> > >+#include <linux/mvolatile.h> > > #include <asm/uaccess.h> > > #include <asm/cacheflush.h> > >@@ -516,7 +517,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma, > > * before we drop the necessary locks. > > */ > > int vma_adjust(struct vm_area_struct *vma, unsigned long start, > >- unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) > >+ unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, > >+ bool *purged) > > { > > struct mm_struct *mm = vma->vm_mm; > > struct vm_area_struct *next = vma->vm_next; > >@@ -527,10 +529,9 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start, > > struct file *file = vma->vm_file; > > long adjust_next = 0; > > int remove_next = 0; > >+ struct vm_area_struct *exporter = NULL; > > if (next && !insert) { > >- struct vm_area_struct *exporter = NULL; > >- > > if (end >= next->vm_end) { > > /* > > * vma expands, overlapping all the next, and > >@@ -621,6 +622,15 @@ again: remove_next = 1 + (end > next->vm_end); > > if (adjust_next) { > > next->vm_start += adjust_next << PAGE_SHIFT; > > next->vm_pgoff += adjust_next; > >+ /* > >+ * Look at mm/mvolatile.c for knowing terminology. > >+ * V4. NNPPVV -> NNNPVV > >+ */ > >+ if (purged) { > >+ *purged = vma_purged(next); > >+ if (exporter == vma) /* V5. VVPPNN -> VVPNNN */ > >+ *purged = vma_purged(vma); > >+ } > > } > > if (root) { > >@@ -651,6 +661,13 @@ again: remove_next = 1 + (end > next->vm_end); > > anon_vma_interval_tree_post_update_vma(vma); > > if (adjust_next) > > anon_vma_interval_tree_post_update_vma(next); > >+ /* > >+ * Look at mm/mvolatile.c for knowing terminology. > >+ * V7. VVPPVV -> VVNPVV > >+ * V8. VVPPVV -> VVPNVV > >+ */ > >+ if (insert) > >+ vma_purge_copy(insert, vma); > > anon_vma_unlock(anon_vma); > > } > > if (mapping) > >@@ -670,6 +687,20 @@ again: remove_next = 1 + (end > next->vm_end); > > } > > if (next->anon_vma) > > anon_vma_merge(vma, next); > >+ > >+ /* > >+ * next is detatched from anon vma chain so purged isn't > >+ * raced any more. > >+ * Look at mm/mvolatile.c for knowing terminology. > >+ * > >+ * V1. NNPPVV -> NNNNVV > >+ * V2. VVPPNN -> VVNNNN > >+ * V3. NNPPNN -> NNNNNN > >+ */ > >+ if (purged) { > >+ *purged |= vma_purged(vma); /* case V2 */ > >+ *purged |= vma_purged(next); /* case V1,V3 */ > >+ } > > mm->map_count--; > > mpol_put(vma_policy(next)); > > kmem_cache_free(vm_area_cachep, next); > >@@ -798,7 +829,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > > struct vm_area_struct *prev, unsigned long addr, > > unsigned long end, unsigned long vm_flags, > > struct anon_vma *anon_vma, struct file *file, > >- pgoff_t pgoff, struct mempolicy *policy) > >+ pgoff_t pgoff, struct mempolicy *policy, bool *purged) > > { > > pgoff_t pglen = (end - addr) >> PAGE_SHIFT; > > struct vm_area_struct *area, *next; > >@@ -808,7 +839,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > > * We later require that vma->vm_flags == vm_flags, > > * so this tests vma->vm_flags & VM_SPECIAL, too. > > */ > >- if (vm_flags & VM_SPECIAL) > >+ if (vm_flags & (VM_SPECIAL|VM_VOLATILE)) > > return NULL; > > if (prev) > >@@ -837,10 +868,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > > next->anon_vma, NULL)) { > > /* cases 1, 6 */ > > err = vma_adjust(prev, prev->vm_start, > >- next->vm_end, prev->vm_pgoff, NULL); > >+ next->vm_end, prev->vm_pgoff, NULL, purged); > > } else /* cases 2, 5, 7 */ > > err = vma_adjust(prev, prev->vm_start, > >- end, prev->vm_pgoff, NULL); > >+ end, prev->vm_pgoff, NULL, purged); > > if (err) > > return NULL; > > khugepaged_enter_vma_merge(prev); > >@@ -856,10 +887,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, > > anon_vma, file, pgoff+pglen)) { > > if (prev && addr < prev->vm_end) /* case 4 */ > > err = vma_adjust(prev, prev->vm_start, > >- addr, prev->vm_pgoff, NULL); > >+ addr, prev->vm_pgoff, NULL, purged); > > else /* cases 3, 8 */ > > err = vma_adjust(area, addr, next->vm_end, > >- next->vm_pgoff - pglen, NULL); > >+ next->vm_pgoff - pglen, NULL, purged); > > if (err) > > return NULL; > > khugepaged_enter_vma_merge(area); > >@@ -1292,7 +1323,8 @@ munmap_back: > > /* > > * Can we just expand an old mapping? > > */ > >- vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL); > >+ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, > >+ pgoff, NULL, NULL); > > if (vma) > > goto out; > >@@ -2025,9 +2057,10 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, > > if (new_below) > > err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff + > >- ((addr - new->vm_start) >> PAGE_SHIFT), new); > >+ ((addr - new->vm_start) >> PAGE_SHIFT), new, NULL); > > else > >- err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new); > >+ err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, > >+ new, NULL); > > /* Success. */ > > if (!err) > >@@ -2240,7 +2273,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len) > > /* Can we just expand an old private anonymous mapping? */ > > vma = vma_merge(mm, prev, addr, addr + len, flags, > >- NULL, NULL, pgoff, NULL); > >+ NULL, NULL, pgoff, NULL, NULL); > > if (vma) > > goto out; > >@@ -2396,7 +2429,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > > if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) > > return NULL; /* should never get here */ > > new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, > >- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma)); > >+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > >+ NULL); > > if (new_vma) { > > /* > > * Source vma may have been merged into new_vma > >diff --git a/mm/mprotect.c b/mm/mprotect.c > >index a409926..f461177 100644 > >--- a/mm/mprotect.c > >+++ b/mm/mprotect.c > >@@ -179,7 +179,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, > > */ > > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > > *pprev = vma_merge(mm, *pprev, start, end, newflags, > >- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma)); > >+ vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), > >+ NULL); > > if (*pprev) { > > vma = *pprev; > > goto success; > >diff --git a/mm/mremap.c b/mm/mremap.c > >index 1b61c2d..8586c52 100644 > >--- a/mm/mremap.c > >+++ b/mm/mremap.c > >@@ -512,7 +512,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, > > int pages = (new_len - old_len) >> PAGE_SHIFT; > > if (vma_adjust(vma, vma->vm_start, addr + new_len, > >- vma->vm_pgoff, NULL)) { > >+ vma->vm_pgoff, NULL, NULL)) { > > ret = -ENOMEM; > > goto out; > > } > >diff --git a/mm/mvolatile.c b/mm/mvolatile.c > >new file mode 100644 > >index 0000000..8b812d2 > >--- /dev/null > >+++ b/mm/mvolatile.c > >@@ -0,0 +1,312 @@ > >+/* > >+ * linux/mm/mvolatile.c > >+ * > >+ * Copyright 2012 Minchan Kim > >+ * > >+ * This work is licensed under the terms of the GNU GPL, version 2. See > >+ * the COPYING file in the top-level directory. > >+ */ > >+ > >+#include <linux/mvolatile.h> > >+#include <linux/mm_types.h> > >+#include <linux/mm.h> > >+#include <linux/rmap.h> > >+#include <linux/mempolicy.h> > >+ > >+#ifndef CONFIG_VOLATILE_PAGE > >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len) > >+{ > >+ return -EINVAL; > >+} > >+ > >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len) > >+{ > >+ return -EINVAL; > >+} > >+#else > >+ > >+#define NO_PURGED 0 > >+#define PURGED 1 > >+ > >+/* > >+ * N: Normal VMA > >+ * V: Volatile VMA > >+ * P: Purged volatile VMA > >+ * > >+ * Assume that each VMA has two block so case 1-8 consists of three VMA. > >+ * For example, NNPPVV means VMA1 has normal VMA, VMA2 has purged volailte VMA, > >+ * and VMA3 has volatile VMA. With another example, NNPVVV means VMA1 has > >+ * normal VMA, VMA2-1 has purged volatile VMA, VMA2-2 has volatile VMA. > >+ * > >+ * Case 7,8 create a new VMA and we call it VMA4 which can be loated before VMA2 > >+ * or after. > >+ * > >+ * Notice: The merge between volatile VMAs shouldn't happen. > >+ * If we call mnovolatile(VMA2), > >+ * > >+ * Case 1 NNPPVV -> NNNNVV > >+ * Case 2 VVPPNN -> VVNNNN > >+ * Case 3 NNPPNN -> NNNNNN > >+ * Case 4 NNPPVV -> NNNPVV > >+ * case 5 VVPPNN -> VVPNNN > >+ * case 6 VVPPVV -> VVNNVV > >+ * case 7 VVPPVV -> VVNPVV > >+ * case 8 VVPPVV -> VVPNVV > >+ */ > >+static int do_mnovolatile(struct vm_area_struct *vma, > >+ struct vm_area_struct **prev, unsigned long start, > >+ unsigned long end, bool *is_purged) > >+{ > >+ unsigned long new_flags; > >+ int error = 0; > >+ struct mm_struct *mm = vma->vm_mm; > >+ pgoff_t pgoff; > >+ bool purged = false; > >+ > >+ new_flags = vma->vm_flags & ~VM_VOLATILE; > >+ if (new_flags == vma->vm_flags) { > >+ *prev = vma; > >+ goto success; > >+ } > >+ > >+ > >+ pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); > >+ *prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma, > >+ vma->vm_file, pgoff, vma_policy(vma), &purged); > >+ if (*prev) { > >+ vma = *prev; > >+ goto success; > >+ } > >+ > >+ *prev = vma; > >+ > >+ if (start != vma->vm_start) { > >+ error = split_vma(mm, vma, start, 1); > >+ if (error) > >+ goto out; > >+ } > >+ > >+ if (end != vma->vm_end) { > >+ error = split_vma(mm, vma, end, 0); > >+ if (error) > >+ goto out; > >+ } > >+ > >+success: > >+ /* V6. VVPPVV -> VVNNVV */ > >+ vma_lock_anon_vma(vma); > >+ *is_purged |= (vma->purged|purged); > >+ vma_unlock_anon_vma(vma); > >+ > >+ vma->vm_flags = new_flags; > >+ vma->purged = false; > >+ return 0; > >+out: > >+ return error; > >+} > >+ > >+/* I didn't look into KSM/Hugepage so disalbed them */ > >+#define VM_NO_VOLATILE (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|\ > >+ VM_MERGEABLE|VM_HUGEPAGE|VM_LOCKED) > >+ > >+static int do_mvolatile(struct vm_area_struct *vma, > >+ struct vm_area_struct **prev, unsigned long start, unsigned long end) > >+{ > >+ int error = -EINVAL; > >+ vm_flags_t new_flags = vma->vm_flags; > >+ struct mm_struct *mm = vma->vm_mm; > >+ > >+ new_flags |= VM_VOLATILE; > >+ > >+ /* Note : Current version doesn't support file vma volatile */ > >+ if (vma->vm_file) { > >+ *prev = vma; > >+ goto out; > >+ } > >+ > >+ if (vma->vm_flags & VM_NO_VOLATILE || > >+ (vma == get_gate_vma(current->mm))) { > >+ *prev = vma; > >+ goto out; > >+ } > >+ /* > >+ * In case of calling MADV_VOLATILE again, > >+ * We just reset purged state. > >+ */ > >+ if (new_flags == vma->vm_flags) { > >+ *prev = vma; > >+ vma_lock_anon_vma(vma); > >+ vma->purged = false; > >+ vma_unlock_anon_vma(vma); > >+ error = 0; > >+ goto out; > >+ } > >+ > >+ *prev = vma; > >+ > >+ if (start != vma->vm_start) { > >+ error = split_vma(mm, vma, start, 1); > >+ if (error) > >+ goto out; > >+ } > >+ > >+ if (end != vma->vm_end) { > >+ error = split_vma(mm, vma, end, 0); > >+ if (error) > >+ goto out; > >+ } > >+ > >+ error = 0; > >+ > >+ vma_lock_anon_vma(vma); > >+ vma->vm_flags = new_flags; > >+ vma_unlock_anon_vma(vma); > >+out: > >+ return error; > >+} > >+ > >+/* > >+ * Return -EINVAL if range doesn't include a right vma at all. > >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough to > >+ * merge/split vmas. > >+ * Return 0 if range consists of only proper vmas. > >+ * Return 1 if part of range includes inavlid area(ex, hole/huge/ksm/mlock/ > >+ * special area) > >+ */ > >+SYSCALL_DEFINE2(mvolatile, unsigned long, start, size_t, len) > >+{ > >+ unsigned long end, tmp; > >+ struct vm_area_struct *vma, *prev; > >+ bool invalid = false; > >+ int error = -EINVAL; > >+ > >+ down_write(¤t->mm->mmap_sem); > >+ if (start & ~PAGE_MASK) > >+ goto out; > >+ > >+ len &= PAGE_MASK; > >+ if (!len) > >+ goto out; > >+ > >+ end = start + len; > >+ if (end < start) > >+ goto out; > >+ > >+ vma = find_vma_prev(current->mm, start, &prev); > >+ if (!vma) > >+ goto out; > >+ > >+ if (start > vma->vm_start) > >+ prev = vma; > >+ > >+ for (;;) { > >+ /* Here start < (end|vma->vm_end). */ > >+ if (start < vma->vm_start) { > >+ start = vma->vm_start; > >+ if (start >= end) > >+ goto out; > >+ invalid = true; > >+ } > >+ > >+ /* Here vma->vm_start <= start < (end|vma->vm_end) */ > >+ tmp = vma->vm_end; > >+ if (end < tmp) > >+ tmp = end; > >+ > >+ /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > >+ error = do_mvolatile(vma, &prev, start, tmp); > >+ if (error == -ENOMEM) { > >+ up_write(¤t->mm->mmap_sem); > >+ return error; > >+ } > >+ if (error == -EINVAL) > >+ invalid = true; > >+ else > >+ error = 0; > >+ start = tmp; > >+ if (prev && start < prev->vm_end) > >+ start = prev->vm_end; > >+ if (start >= end) > >+ break; > >+ > >+ vma = prev->vm_next; > >+ if (!vma) > >+ break; > >+ } > >+out: > >+ up_write(¤t->mm->mmap_sem); > >+ return invalid ? 1 : 0; > >+} > >+/* > >+ * Return -ENOMEM with interrupting range opeartion if memory is not enough > >+ * to merge/split vmas. > >+ * Return 1 if part of range includes purged's one, otherwise, return 0 > >+ */ > >+SYSCALL_DEFINE2(mnovolatile, unsigned long, start, size_t, len) > >+{ > >+ unsigned long end, tmp; > >+ struct vm_area_struct *vma, *prev; > >+ int ret, error = -EINVAL; > >+ bool is_purged = false; > >+ > >+ down_write(¤t->mm->mmap_sem); > >+ if (start & ~PAGE_MASK) > >+ goto out; > >+ > >+ len &= PAGE_MASK; > >+ if (!len) > >+ goto out; > >+ > >+ end = start + len; > >+ if (end < start) > >+ goto out; > >+ > >+ vma = find_vma_prev(current->mm, start, &prev); > >+ if (!vma) > >+ goto out; > >+ > >+ if (start > vma->vm_start) > >+ prev = vma; > >+ > >+ for (;;) { > >+ /* Here start < (end|vma->vm_end). */ > >+ if (start < vma->vm_start) { > >+ start = vma->vm_start; > >+ if (start >= end) > >+ goto out; > >+ } > >+ > >+ /* Here vma->vm_start <= start < (end|vma->vm_end) */ > >+ tmp = vma->vm_end; > >+ if (end < tmp) > >+ tmp = end; > >+ > >+ /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > >+ error = do_mnovolatile(vma, &prev, start, tmp, &is_purged); > >+ if (error) { > >+ WARN_ON(error != -ENOMEM); > >+ goto out; > >+ } > >+ start = tmp; > >+ if (prev && start < prev->vm_end) > >+ start = prev->vm_end; > >+ if (start >= end) > >+ break; > >+ > >+ vma = prev->vm_next; > >+ if (!vma) > >+ break; > >+ } > >+out: > >+ up_write(¤t->mm->mmap_sem); > >+ > >+ if (error) > >+ ret = error; > >+ else if (is_purged) > >+ ret = PURGED; > >+ else > >+ ret = NO_PURGED; > >+ > >+ return ret; > >+} > >+#endif > >diff --git a/mm/rmap.c b/mm/rmap.c > >index 2ee1ef0..402d9da 100644 > >--- a/mm/rmap.c > >+++ b/mm/rmap.c > >@@ -57,6 +57,7 @@ > > #include <linux/migrate.h> > > #include <linux/hugetlb.h> > > #include <linux/backing-dev.h> > >+#include <linux/mvolatile.h> > > #include <asm/tlbflush.h> > >@@ -308,6 +309,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > > vma->anon_vma = anon_vma; > > anon_vma_lock(anon_vma); > > anon_vma_chain_link(vma, avc, anon_vma); > >+ vma_purge_copy(vma, pvma); > > anon_vma_unlock(anon_vma); > > return 0; > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>