Re: [RFC v3] Support volatile range for anon vma

Wanpeng Li <liwanp@xxxxxxxxxxxxxxxxxx> · Wed, 12 Dec 2012 16:17:14 +0800

On Wed, Dec 12, 2012 at 02:43:49PM +0800, Wanpeng Li wrote:
>On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
>>Sorry, resending with fixing compile error. :(
>>
>>>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
>>From: Minchan Kim <minchan@xxxxxxxxxx>
>>Date: Tue, 11 Dec 2012 11:38:30 +0900
>>Subject: [RFC v3] Support volatile range for anon vma
>>
>>This still is [RFC v3] because just passed my simple test
>>with TCMalloc tweaking.
>>
>>I hope more inputs from user-space allocator people and test patch
>>with their allocator because it might need design change of arena
>>management design for getting real vaule.
>>
>>Changelog from v2
>>
>> * Removing madvise(addr, length, MADV_NOVOLATILE).
>> * add vmstat about the number of discarded volatile pages
>> * discard volatile pages without promotion in reclaim path
>>
>>This is based on v3.6.
>>
>>- What's the madvise(addr, length, MADV_VOLATILE)?
>>
>>  It's a hint that user deliver to kernel so kernel can *discard*
>>  pages in a range anytime.
>>
>>- What happens if user access page(ie, virtual address) discarded
>>  by kernel?
>>
>>  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
>>
>>- What happens if user access page(ie, virtual address) doesn't
>>  discarded by kernel?
>>
>>  The user can see old data without page fault.
>>
>>- What's different with madvise(DONTNEED)?
>>
>>  System call semantic
>>
>>  DONTNEED makes sure user always can see zero-fill pages after
>>  he calls madvise while VOLATILE can see zero-fill pages or
>>  old data.
>>
>>  Internal implementation
>>
>>  The madvise(DONTNEED) should zap all mapped pages in range so
>>  overhead is increased linearly with the number of mapped pages.
>>  Even, if user access zapped pages by write, page fault + page
>>  allocation + memset should be happened.
>>
>>  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
>>  It doesn't touch pages any more so overhead of the system call
>>  should be very small. If memory pressure happens, VM can discard
>>  pages in VMAs marked by VOLATILE. If user access address with
>>  write mode by discarding by VM, he can see zero-fill pages so the
>>  cost is same with DONTNEED but if memory pressure isn't severe,
>>  user can see old data without (page fault + page allocation + memset)
>>
>>  The VOLATILE mark should be removed in page fault handler when first
>>  page fault occur in marked vma so next page faults will follow normal
>>  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
>>  interface.
>>
>>- What's the benefit compared to DONTNEED?
>>
>>  1. The system call overhead is smaller because VOLATILE just marks
>>     the flag to VMA instead of zapping all the page in a range.
>>
>>  2. It has a chance to eliminate overheads (ex, page fault +
>>     page allocation + memset(PAGE_SIZE)).
>>
>>- Isn't there any drawback?
>>
>>  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
>>  fault of other threads could be allowed. But VOLATILE needs exclusive
>>  mmap_sem so other thread would be blocked if they try to access
>>  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
>>  should be small as far as possible.
>>
>>  Other concern of exclusive mmap_sem is when page fault occur in
>>  VOLATILE marked vma. We should remove the flag of vma and merge
>>  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
>>  handling and prevent concurrent page fault. But we need such handling
>>  just once when page fault occur after we mark VOLATILE into VMA
>>  only if memory pressure happpens so the page is discarded. So it wouldn't
>>  not common so that benefit we get by this feature would be bigger than
>>  lose.
>>
>>- What's for targetting?
>>
>>  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
>>  of virtual machine like Dalvik. Also, it comes in handy for embedded
>>  which doesn't have swap device so they can't reclaim anonymous pages.
>>  By discarding instead of swap, it could be used in the non-swap system.
>>  For it,  we have to age anon lru list although we don't have swap because
>>  I don't want to discard volatile pages by top priority when memory pressure
>>  happens as volatile in this patch means "We don't need to swap out because
>>  user can handle the situation which data are disappear suddenly", NOT
>>  "They are useless so hurry up to reclaim them". So I want to apply same
>>  aging rule of nomal pages to them.
>>
>>  Anonymous page background aging of non-swap system would be a trade-off
>>  for getting good feature. Even, we had done it two years ago until merge
>>  [1] and I believe gain of this patch will beat loss of anon lru aging's
>>  overead once all of allocator start to use madvise.
>>  (This patch doesn't include background aging in case of non-swap system
>>  but it's trivial if we decide)
>>
>>[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>>
>>Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
>>Cc: Arun Sharma <asharma@xxxxxx>
>>Cc: sanjay@xxxxxxxxxx
>>Cc: Paul Turner <pjt@xxxxxxxxxx>
>>CC: David Rientjes <rientjes@xxxxxxxxxx>
>>Cc: John Stultz <john.stultz@xxxxxxxxxx>
>>Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>Cc: Christoph Lameter <cl@xxxxxxxxx>
>>Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
>>Cc: Robert Love <rlove@xxxxxxxxxx>
>>Cc: Mel Gorman <mel@xxxxxxxxx>
>>Cc: Hugh Dickins <hughd@xxxxxxxxxx>
>>Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx>
>>Cc: Rik van Riel <riel@xxxxxxxxxx>
>>Cc: Dave Chinner <david@xxxxxxxxxxxxx>
>>Cc: Neil Brown <neilb@xxxxxxx>
>>Cc: Mike Hommey <mh@xxxxxxxxxxxx>
>>Cc: Taras Glek <tglek@xxxxxxxxxxx>
>>Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx>
>>Cc: Christoph Lameter <cl@xxxxxxxxx>
>>Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>>Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
>>---
>> arch/x86/mm/fault.c               |    2 +
>> include/asm-generic/mman-common.h |    6 ++
>> include/linux/mm.h                |    7 ++-
>> include/linux/rmap.h              |   20 ++++++
>> include/linux/vm_event_item.h     |    2 +-
>> mm/madvise.c                      |   19 +++++-
>> mm/memory.c                       |   32 ++++++++++
>> mm/migrate.c                      |    6 +-
>> mm/rmap.c                         |  125 ++++++++++++++++++++++++++++++++++++-
>> mm/vmscan.c                       |    7 +++
>> mm/vmstat.c                       |    1 +
>> 11 files changed, 218 insertions(+), 9 deletions(-)
>>
>>diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>index 76dcd9d..17c1c20 100644
>>--- a/arch/x86/mm/fault.c
>>+++ b/arch/x86/mm/fault.c
>>@@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>> 		}
>>
>> 		out_of_memory(regs, error_code, address);
>>+	} else if (fault & VM_FAULT_SIGSEG) {
>>+			bad_area(regs, error_code, address);
>> 	} else {
>> 		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
>> 			     VM_FAULT_HWPOISON_LARGE))
>>diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
>>index d030d2c..f07781e 100644
>>--- a/include/asm-generic/mman-common.h
>>+++ b/include/asm-generic/mman-common.h
>>@@ -34,6 +34,12 @@
>> #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>> #define MADV_WILLNEED	3		/* will need these pages */
>> #define MADV_DONTNEED	4		/* don't need these pages */
>>+/*
>>+ * Unlike other flags, we need two locks to protect MADV_VOLATILE.
>>+ * For changing the flag, we need mmap_sem's write lock and volatile_lock
>>+ * while we just need volatile_lock in case of reading the flag.
>>+ */
>>+#define MADV_VOLATILE	5		/* pages will disappear suddenly */
>>
>> /* common parameters: try to keep these consistent across architectures */
>> #define MADV_REMOVE	9		/* remove these pages & resources */
>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>index 311be90..89027b5 100644
>>--- a/include/linux/mm.h
>>+++ b/include/linux/mm.h
>>@@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp);
>> #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>> #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>> #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
>>+#define VM_VOLATILE	0x100000000	/* Pages in the vma could be discarable without swap */
>>
>> /* Bits set in the VMA until the stack is in its final location */
>> #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>>@@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp);
>>  * Special vmas that are non-mergable, non-mlock()able.
>>  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
>>  */
>>-#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
>>+#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE)
>>
>> /*
>>  * mapping from the currently active vm_flags protection bits (the
>>@@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page)
>> #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
>> #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
>> #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>>-
>>+#define VM_FAULT_SIGSEG	0x0800	/* -> There is no vma */
>> #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
>>
>> #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
>>-			 VM_FAULT_HWPOISON_LARGE)
>>+			 VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG)
>>
>> /* Encode hstate index for a hwpoisoned large page */
>> #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
>>diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>index 3fce545..735d7a3 100644
>>--- a/include/linux/rmap.h
>>+++ b/include/linux/rmap.h
>>@@ -67,6 +67,9 @@ struct anon_vma_chain {
>> 	struct list_head same_anon_vma;	/* locked by anon_vma->mutex */
>> };
>>
>>+void volatile_lock(struct vm_area_struct *vma);
>>+void volatile_unlock(struct vm_area_struct *vma);
>>+
>> #ifdef CONFIG_MMU
>> static inline void get_anon_vma(struct anon_vma *anon_vma)
>> {
>>@@ -170,6 +173,7 @@ enum ttu_flags {
>> 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
>> 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
>> 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
>>+	TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */
>> };
>> #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>>
>>@@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
>> 	return ptep;
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *, struct mm_struct *,
>>+                                unsigned long, spinlock_t **);
>>+
>>+static inline pte_t *page_check_volatile_address(struct page *page,
>>+                                        struct mm_struct *mm,
>>+                                        unsigned long address,
>>+                                        spinlock_t **ptlp)
>>+{
>>+        pte_t *ptep;
>>+
>>+        __cond_lock(*ptlp, ptep = __page_check_volatile_address(page,
>>+                                        mm, address, ptlp));
>>+        return ptep;
>>+}
>>+
>> /*
>>  * Used by swapoff to help locate where page is expected in vma.
>>  */
>>@@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page)
>> #define SWAP_AGAIN	1
>> #define SWAP_FAIL	2
>> #define SWAP_MLOCK	3
>>+#define SWAP_DISCARD	4
>>
>> #endif	/* _LINUX_RMAP_H */
>>diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>>index 57f7b10..3f9a40b 100644
>>--- a/include/linux/vm_event_item.h
>>+++ b/include/linux/vm_event_item.h
>>@@ -23,7 +23,7 @@
>>
>> enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> 		FOR_ALL_ZONES(PGALLOC),
>>-		PGFREE, PGACTIVATE, PGDEACTIVATE,
>>+		PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE,
>> 		PGFAULT, PGMAJFAULT,
>> 		FOR_ALL_ZONES(PGREFILL),
>> 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>>diff --git a/mm/madvise.c b/mm/madvise.c
>>index 14d260f..53a19d8 100644
>>--- a/mm/madvise.c
>>+++ b/mm/madvise.c
>>@@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> 		if (error)
>> 			goto out;
>> 		break;
>>+	case MADV_VOLATILE:
>>+		if (vma->vm_flags & VM_LOCKED) {
>>+			error = -EINVAL;
>>+			goto out;
>>+		}
>>+		new_flags |= VM_VOLATILE;
>>+		break;
>> 	}
>>
>> 	if (new_flags == vma->vm_flags) {
>>@@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma,
>> success:
>> 	/*
>> 	 * vm_flags is protected by the mmap_sem held in write mode.
>>+	 * In caes of MADV_VOLATILE, we need anon_vma_lock additionally.
>> 	 */
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_lock(vma);
>> 	vma->vm_flags = new_flags;
>>-
>>+	if (behavior == MADV_VOLATILE)
>>+		volatile_unlock(vma);
>> out:
>> 	if (error == -ENOMEM)
>> 		error = -EAGAIN;
>>@@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior)
>> #endif
>> 	case MADV_DONTDUMP:
>> 	case MADV_DODUMP:
>>+	case MADV_VOLATILE:
>> 		return 1;
>>
>> 	default:
>>@@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>> 		goto out;
>> 	len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>
>>+	if (behavior != MADV_VOLATILE)
>>+		len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>>+	else
>>+		len = len_in & PAGE_MASK;
>>+
>> 	/* Check to see whether len was rounded up from small -ve to zero */
>> 	if (len_in && !len)
>> 		goto out;
>>diff --git a/mm/memory.c b/mm/memory.c
>>index 5736170..b5e4996 100644
>>--- a/mm/memory.c
>>+++ b/mm/memory.c
>>@@ -57,6 +57,7 @@
>> #include <linux/swapops.h>
>> #include <linux/elf.h>
>> #include <linux/gfp.h>
>>+#include <linux/mempolicy.h>
>>
>> #include <asm/io.h>
>> #include <asm/pgalloc.h>
>>@@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm,
>> 					return do_linear_fault(mm, vma, address,
>> 						pte, pmd, flags, entry);
>> 			}
>>+			if (vma->vm_flags & VM_VOLATILE) {
>>+				struct vm_area_struct *prev;
>>+
>>+				up_read(&mm->mmap_sem);
>>+				down_write(&mm->mmap_sem);
>>+				vma = find_vma_prev(mm, address, &prev);
>>+
>>+				/* Someone unmap the vma */
>>+				if (unlikely(!vma) || vma->vm_start > address) {
>>+					downgrade_write(&mm->mmap_sem);
>>+					return VM_FAULT_SIGSEG;
>>+				}
>>+				/* Someone else already hanlded */
>>+				if (vma->vm_flags & VM_VOLATILE) {
>>+					/*
>>+					 * From now on, we hold mmap_sem as
>>+					 * exclusive.
>>+					 */
>>+					volatile_lock(vma);
>>+					vma->vm_flags &= ~VM_VOLATILE;
>>+					volatile_unlock(vma);
>>+
>>+					vma_merge(mm, prev, vma->vm_start,
>>+						vma->vm_end, vma->vm_flags,
>>+						vma->anon_vma, vma->vm_file,
>>+						vma->vm_pgoff, vma_policy(vma));
>>+
>>+				}
>>+
>>+				downgrade_write(&mm->mmap_sem);
>>+			}
>> 			return do_anonymous_page(mm, vma, address,
>> 						 pte, pmd, flags);
>> 		}
>>diff --git a/mm/migrate.c b/mm/migrate.c
>>index 77ed2d7..08b009c 100644
>>--- a/mm/migrate.c
>>+++ b/mm/migrate.c
>>@@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> 	}
>>
>> 	/* Establish migration ptes or remove ptes */
>>-	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> skip_unmap:
>> 	if (!page_mapped(page))
>>@@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>> 	if (PageAnon(hpage))
>> 		anon_vma = page_get_anon_vma(hpage);
>>
>>-	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>>+	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|
>>+				TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE);
>>
>> 	if (!page_mapped(hpage))
>> 		rc = move_to_new_page(new_hpage, hpage, 1, mode);
>>diff --git a/mm/rmap.c b/mm/rmap.c
>>index 0f3b7cd..1a0ab2b 100644
>>--- a/mm/rmap.c
>>+++ b/mm/rmap.c
>>@@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
>> 	return vma_address(page, vma);
>> }
>>
>>+pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm,
>>+		unsigned long address, spinlock_t **ptlp)
>>+{
>>+	pgd_t *pgd;
>>+	pud_t *pud;
>>+	pmd_t *pmd;
>>+	pte_t *pte;
>>+	spinlock_t *ptl;
>>+
>>+	swp_entry_t entry = { .val = page_private(page) };
>>+
>>+	if (unlikely(PageHuge(page))) {
>>+		pte = huge_pte_offset(mm, address);
>>+		ptl = &mm->page_table_lock;
>>+		goto check;
>>+	}
>>+
>>+	pgd = pgd_offset(mm, address);
>>+	if (!pgd_present(*pgd))
>>+		return NULL;
>>+
>>+	pud = pud_offset(pgd, address);
>>+	if (!pud_present(*pud))
>>+		return NULL;
>>+
>>+	pmd = pmd_offset(pud, address);
>>+	if (!pmd_present(*pmd))
>>+		return NULL;
>>+	if (pmd_trans_huge(*pmd))
>>+		return NULL;
>>+
>>+	pte = pte_offset_map(pmd, address);
>>+	ptl = pte_lockptr(mm, pmd);
>>+check:
>>+	spin_lock(ptl);
>>+	if (PageAnon(page)) {
>>+		if (!pte_present(*pte) && entry.val ==
>>+				pte_to_swp_entry(*pte).val) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	} else {
>>+		if (pte_none(*pte)) {
>>+			*ptlp = ptl;
>>+			return pte;
>>+		}
>>+	}
>>+	pte_unmap_unlock(pte, ptl);
>>+	return NULL;
>>+}
>>+
>> /*
>>  * Check that @page is mapped at @address into @mm.
>>  *
>>@@ -1218,6 +1269,35 @@ out:
>> 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> }
>>
>>+int try_to_zap_one(struct page *page, struct vm_area_struct *vma,
>>+                unsigned long address)
>>+{
>>+        struct mm_struct *mm = vma->vm_mm;
>>+        pte_t *pte;
>>+        pte_t pteval;
>>+        spinlock_t *ptl;
>>+
>>+        pte = page_check_volatile_address(page, mm, address, &ptl);
>>+        if (!pte)
>>+                return 0;
>>+
>>+        /* Nuke the page table entry. */
>>+        flush_cache_page(vma, address, page_to_pfn(page));
>>+        pteval = ptep_clear_flush(vma, address, pte);
>>+
>>+        if (PageAnon(page)) {
>>+                swp_entry_t entry = { .val = page_private(page) };
>>+                if (PageSwapCache(page)) {
>>+                        dec_mm_counter(mm, MM_SWAPENTS);
>>+                        swap_free(entry);
>>+                }
>>+        }
>>+
>>+        pte_unmap_unlock(pte, ptl);
>>+        mmu_notifier_invalidate_page(mm, address);
>>+        return 1;
>>+}
>>+
>> /*
>>  * Subfunctions of try_to_unmap: try_to_unmap_one called
>>  * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file.
>>@@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 	struct anon_vma *anon_vma;
>> 	struct anon_vma_chain *avc;
>> 	int ret = SWAP_AGAIN;
>>+	bool is_volatile = true;
>>+
>>+	if (flags & TTU_IGNORE_VOLATILE)
>>+		is_volatile = false;
>>
>> 	anon_vma = page_lock_anon_vma(page);
>> 	if (!anon_vma)
>>@@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
>> 		 * temporary VMAs until after exec() completes.
>> 		 */
>> 		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
>>-				is_vma_temporary_stack(vma))
>>+				is_vma_temporary_stack(vma)) {
>>+			is_volatile = false;
>> 			continue;
>>+		}
>>
>> 		address = vma_address(page, vma);
>> 		if (address == -EFAULT)
>> 			continue;
>>+                /*
>>+                 * A volatile page will only be purged if ALL vmas
>>+		 * pointing to it are VM_VOLATILE.
>>+                 */
>>+                if (!(vma->vm_flags & VM_VOLATILE))
>>+                        is_volatile = false;
>>+
>> 		ret = try_to_unmap_one(page, vma, address, flags);
>> 		if (ret != SWAP_AGAIN || !page_mapped(page))
>> 			break;
>> 	}
>>
>>+        if (page_mapped(page) || is_volatile == false)
>>+                goto out;
>>+
>>+        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>>+                struct vm_area_struct *vma = avc->vma;
>>+                unsigned long address;
>>+
>>+                address = vma_address(page, vma);
>>+                try_to_zap_one(page, vma, address);
>>+        }
>>+        /* We're throwing this page out, so mark it clean */
>>+        ClearPageDirty(page);
>>+        ret = SWAP_DISCARD;
>>+out:
>> 	page_unlock_anon_vma(anon_vma);
>> 	return ret;
>> }
>>@@ -1651,6 +1758,7 @@ out:
>>  * SWAP_AGAIN	- we missed a mapping, try again later
>>  * SWAP_FAIL	- the page is unswappable
>>  * SWAP_MLOCK	- page is mlocked.
>>+ * SWAP_DISCARD - page is volatile.
>>  */
>> int try_to_unmap(struct page *page, enum ttu_flags flags)
>> {
>>@@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>> 		ret = try_to_unmap_anon(page, flags);
>> 	else
>> 		ret = try_to_unmap_file(page, flags);
>>-	if (ret != SWAP_MLOCK && !page_mapped(page))
>>+	if (ret != SWAP_MLOCK && !page_mapped(page) &&
>>+					ret != SWAP_DISCARD)
>> 		ret = SWAP_SUCCESS;
>> 	return ret;
>> }
>>@@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma)
>> 	anon_vma_free(anon_vma);
>> }
>>
>>+void volatile_lock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_lock(vma->anon_vma);
>>+}
>>+
>>+void volatile_unlock(struct vm_area_struct *vma)
>>+{
>>+        if (vma->anon_vma)
>>+                anon_vma_unlock(vma->anon_vma);
>>+}
>>+
>> #ifdef CONFIG_MIGRATION
>> /*
>>  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>>index 99b434b..4e463a4 100644
>>--- a/mm/vmscan.c
>>+++ b/mm/vmscan.c
>>@@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page,
>> 	if (vm_flags & VM_LOCKED)
>> 		return PAGEREF_RECLAIM;
>>
>>+	if (vm_flags & VM_VOLATILE)
>>+		return PAGEREF_RECLAIM;
>>+
>> 	if (referenced_ptes) {
>> 		if (PageSwapBacked(page))
>> 			return PAGEREF_ACTIVATE;
>>@@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 		 */
>
>Hi Minchan,
>
>IIUC, anonymous page has already add to swapcache through add_to_swap called
>by shrink_page_list, but I can't figure out where you remove it from swapache.
>

Yeah, they all done in shrink_page_list, I mean if you can avoid the process of 
add to swapcache and remove it from swapcache since your idea don't need swapout.

>Regards,
>Wanpeng Li 
>
>> 		if (page_mapped(page) && mapping) {
>> 			switch (try_to_unmap(page, TTU_UNMAP)) {
>>+			case SWAP_DISCARD:
>>+				count_vm_event(PGVOLATILE);
>>+				goto discard_page;
>> 			case SWAP_FAIL:
>> 				goto activate_locked;
>> 			case SWAP_AGAIN:
>>@@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> 			}
>> 		}
>>
>>+discard_page:
>> 		/*
>> 		 * If the page has buffers, try to free the buffer mappings
>> 		 * associated with this page. If we succeed we try to free
>>diff --git a/mm/vmstat.c b/mm/vmstat.c
>>index df7a674..410caf5 100644
>>--- a/mm/vmstat.c
>>+++ b/mm/vmstat.c
>>@@ -734,6 +734,7 @@ const char * const vmstat_text[] = {
>> 	TEXTS_FOR_ZONES("pgalloc")
>>
>> 	"pgfree",
>>+	"pgvolatile",
>> 	"pgactivate",
>> 	"pgdeactivate",
>>
>>-- 
>>1.7.9.5
>>
>>-- 
>>Kind regards,
>>Minchan Kim
>>
>>--
>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>>see: http://www.linux-mm.org/ .
>>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>