This still is [RFC v3] because just passed my simple test with TCMalloc tweaking. I hope more inputs from user-space allocator people and test patch with their allocator because it might need design change of arena management design for getting real vaule. Changelog from v2 * Removing madvise(addr, length, MADV_NOVOLATILE). * add vmstat about the number of discarded volatile pages * discard volatile pages without promotion in reclaim path This is based on v3.6. - What's the madvise(addr, length, MADV_VOLATILE)? It's a hint that user deliver to kernel so kernel can *discard* pages in a range anytime. - What happens if user access page(ie, virtual address) discarded by kernel? The user can see zero-fill-on-demand pages as if madvise(DONTNEED). - What happens if user access page(ie, virtual address) doesn't discarded by kernel? The user can see old data without page fault. - What's different with madvise(DONTNEED)? System call semantic DONTNEED makes sure user always can see zero-fill pages after he calls madvise while VOLATILE can see zero-fill pages or old data. Internal implementation The madvise(DONTNEED) should zap all mapped pages in range so overhead is increased linearly with the number of mapped pages. Even, if user access zapped pages by write, page fault + page allocation + memset should be happened. The madvise(VOLATILE) should mark the flag in a range(ie, VMA). It doesn't touch pages any more so overhead of the system call should be very small. If memory pressure happens, VM can discard pages in VMAs marked by VOLATILE. If user access address with write mode by discarding by VM, he can see zero-fill pages so the cost is same with DONTNEED but if memory pressure isn't severe, user can see old data without (page fault + page allocation + memset) The VOLATILE mark should be removed in page fault handler when first page fault occur in marked vma so next page faults will follow normal page fault path. That's why user don't need madvise(MADV_NOVOLATILE) interface. - What's the benefit compared to DONTNEED? 1. The system call overhead is smaller because VOLATILE just marks the flag to VMA instead of zapping all the page in a range. 2. It has a chance to eliminate overheads (ex, page fault + page allocation + memset(PAGE_SIZE)). - Isn't there any drawback? DONTNEED doesn't need exclusive mmap_sem locking so concurrent page fault of other threads could be allowed. But VOLATILE needs exclusive mmap_sem so other thread would be blocked if they try to access not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead should be small as far as possible. Other concern of exclusive mmap_sem is when page fault occur in VOLATILE marked vma. We should remove the flag of vma and merge adjacent vmas so needs exclusive mmap_sem. It can slow down page fault handling and prevent concurrent page fault. But we need such handling just once when page fault occur after we mark VOLATILE into VMA only if memory pressure happpens so the page is discarded. So it wouldn't not common so that benefit we get by this feature would be bigger than lose. - What's for targetting? Firstly, user-space allocator like ptmalloc, tcmalloc or heap management of virtual machine like Dalvik. Also, it comes in handy for embedded which doesn't have swap device so they can't reclaim anonymous pages. By discarding instead of swap, it could be used in the non-swap system. For it, we have to age anon lru list although we don't have swap because I don't want to discard volatile pages by top priority when memory pressure happens as volatile in this patch means "We don't need to swap out because user can handle the situation which data are disappear suddenly", NOT "They are useless so hurry up to reclaim them". So I want to apply same aging rule of nomal pages to them. Anonymous page background aging of non-swap system would be a trade-off for getting good feature. Even, we had done it two years ago until merge [1] and I believe gain of this patch will beat loss of anon lru aging's overead once all of allocator start to use madvise. (This patch doesn't include background aging in case of non-swap system but it's trivial if we decide) [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx> Cc: Arun Sharma <asharma@xxxxxx> Cc: sanjay@xxxxxxxxxx Cc: Paul Turner <pjt@xxxxxxxxxx> CC: David Rientjes <rientjes@xxxxxxxxxx> Cc: John Stultz <john.stultz@xxxxxxxxxx> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Christoph Lameter <cl@xxxxxxxxx> Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx> Cc: Robert Love <rlove@xxxxxxxxxx> Cc: Mel Gorman <mel@xxxxxxxxx> Cc: Hugh Dickins <hughd@xxxxxxxxxx> Cc: Dave Hansen <dave@xxxxxxxxxxxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Cc: Dave Chinner <david@xxxxxxxxxxxxx> Cc: Neil Brown <neilb@xxxxxxx> Cc: Mike Hommey <mh@xxxxxxxxxxxx> Cc: Taras Glek <tglek@xxxxxxxxxxx> Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxx> Cc: Christoph Lameter <cl@xxxxxxxxx> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> --- arch/x86/mm/fault.c | 2 + include/asm-generic/mman-common.h | 6 ++ include/linux/mm.h | 7 ++- include/linux/rmap.h | 20 ++++++ include/linux/vm_event_item.h | 2 +- mm/madvise.c | 19 +++++- mm/memory.c | 32 ++++++++++ mm/migrate.c | 6 +- mm/rmap.c | 125 ++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 7 +++ mm/vmstat.c | 1 + 11 files changed, 218 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 76dcd9d..a734166 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -879,6 +879,8 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, } out_of_memory(regs, error_code, address); + } else if (fault & VM_FAULT_BAD_AREA) { + bad_area(regs, error_code, address); } else { if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON| VM_FAULT_HWPOISON_LARGE)) diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index d030d2c..f07781e 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -34,6 +34,12 @@ #define MADV_SEQUENTIAL 2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +/* + * Unlike other flags, we need two locks to protect MADV_VOLATILE. + * For changing the flag, we need mmap_sem's write lock and volatile_lock + * while we just need volatile_lock in case of reading the flag. + */ +#define MADV_VOLATILE 5 /* pages will disappear suddenly */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE 9 /* remove these pages & resources */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 311be90..89027b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -119,6 +119,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */ +#define VM_VOLATILE 0x100000000 /* Pages in the vma could be discarable without swap */ /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) @@ -143,7 +144,7 @@ extern unsigned int kobjsize(const void *objp); * Special vmas that are non-mergable, non-mlock()able. * Note: mm/huge_memory.c VM_NO_THP depends on this definition. */ -#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP) +#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP | VM_VOLATILE) /* * mapping from the currently active vm_flags protection bits (the @@ -872,11 +873,11 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ - +#define VM_FAULT_SIGSEG 0x0800 /* -> There is no vma */ #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */ #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \ - VM_FAULT_HWPOISON_LARGE) + VM_FAULT_HWPOISON_LARGE | VM_FAULT_SIGSEG) /* Encode hstate index for a hwpoisoned large page */ #define VM_FAULT_SET_HINDEX(x) ((x) << 12) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 3fce545..735d7a3 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -67,6 +67,9 @@ struct anon_vma_chain { struct list_head same_anon_vma; /* locked by anon_vma->mutex */ }; +void volatile_lock(struct vm_area_struct *vma); +void volatile_unlock(struct vm_area_struct *vma); + #ifdef CONFIG_MMU static inline void get_anon_vma(struct anon_vma *anon_vma) { @@ -170,6 +173,7 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ + TTU_IGNORE_VOLATILE = (1 << 11),/* ignore volatile */ }; #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) @@ -194,6 +198,21 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm, return ptep; } +pte_t *__page_check_volatile_address(struct page *, struct mm_struct *, + unsigned long, spinlock_t **); + +static inline pte_t *page_check_volatile_address(struct page *page, + struct mm_struct *mm, + unsigned long address, + spinlock_t **ptlp) +{ + pte_t *ptep; + + __cond_lock(*ptlp, ptep = __page_check_volatile_address(page, + mm, address, ptlp)); + return ptep; +} + /* * Used by swapoff to help locate where page is expected in vma. */ @@ -257,5 +276,6 @@ static inline int page_mkclean(struct page *page) #define SWAP_AGAIN 1 #define SWAP_FAIL 2 #define SWAP_MLOCK 3 +#define SWAP_DISCARD 4 #endif /* _LINUX_RMAP_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 57f7b10..3f9a40b 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -23,7 +23,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, FOR_ALL_ZONES(PGALLOC), - PGFREE, PGACTIVATE, PGDEACTIVATE, + PGFREE, PGVOLATILE, PGACTIVATE, PGDEACTIVATE, PGFAULT, PGMAJFAULT, FOR_ALL_ZONES(PGREFILL), FOR_ALL_ZONES(PGSTEAL_KSWAPD), diff --git a/mm/madvise.c b/mm/madvise.c index 14d260f..53a19d8 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -86,6 +86,13 @@ static long madvise_behavior(struct vm_area_struct * vma, if (error) goto out; break; + case MADV_VOLATILE: + if (vma->vm_flags & VM_LOCKED) { + error = -EINVAL; + goto out; + } + new_flags |= VM_VOLATILE; + break; } if (new_flags == vma->vm_flags) { @@ -118,9 +125,13 @@ static long madvise_behavior(struct vm_area_struct * vma, success: /* * vm_flags is protected by the mmap_sem held in write mode. + * In caes of MADV_VOLATILE, we need anon_vma_lock additionally. */ + if (behavior == MADV_VOLATILE) + volatile_lock(vma); vma->vm_flags = new_flags; - + if (behavior == MADV_VOLATILE) + volatile_unlock(vma); out: if (error == -ENOMEM) error = -EAGAIN; @@ -310,6 +321,7 @@ madvise_behavior_valid(int behavior) #endif case MADV_DONTDUMP: case MADV_DODUMP: + case MADV_VOLATILE: return 1; default: @@ -385,6 +397,11 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) goto out; len = (len_in + ~PAGE_MASK) & PAGE_MASK; + if (behavior != MADV_VOLATILE) + len = (len_in + ~PAGE_MASK) & PAGE_MASK; + else + len = len_in & PAGE_MASK; + /* Check to see whether len was rounded up from small -ve to zero */ if (len_in && !len) goto out; diff --git a/mm/memory.c b/mm/memory.c index 5736170..b5e4996 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -57,6 +57,7 @@ #include <linux/swapops.h> #include <linux/elf.h> #include <linux/gfp.h> +#include <linux/mempolicy.h> #include <asm/io.h> #include <asm/pgalloc.h> @@ -3446,6 +3447,37 @@ int handle_pte_fault(struct mm_struct *mm, return do_linear_fault(mm, vma, address, pte, pmd, flags, entry); } + if (vma->vm_flags & VM_VOLATILE) { + struct vm_area_struct *prev; + + up_read(&mm->mmap_sem); + down_write(&mm->mmap_sem); + vma = find_vma_prev(mm, address, &prev); + + /* Someone unmap the vma */ + if (unlikely(!vma) || vma->vm_start > address) { + downgrade_write(&mm->mmap_sem); + return VM_FAULT_SIGSEG; + } + /* Someone else already hanlded */ + if (vma->vm_flags & VM_VOLATILE) { + /* + * From now on, we hold mmap_sem as + * exclusive. + */ + volatile_lock(vma); + vma->vm_flags &= ~VM_VOLATILE; + volatile_unlock(vma); + + vma_merge(mm, prev, vma->vm_start, + vma->vm_end, vma->vm_flags, + vma->anon_vma, vma->vm_file, + vma->vm_pgoff, vma_policy(vma)); + + } + + downgrade_write(&mm->mmap_sem); + } return do_anonymous_page(mm, vma, address, pte, pmd, flags); } diff --git a/mm/migrate.c b/mm/migrate.c index 77ed2d7..08b009c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -800,7 +800,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, } /* Establish migration ptes or remove ptes */ - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK| + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); skip_unmap: if (!page_mapped(page)) @@ -915,7 +916,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, if (PageAnon(hpage)) anon_vma = page_get_anon_vma(hpage); - try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK| + TTU_IGNORE_ACCESS|TTU_IGNORE_VOLATILE); if (!page_mapped(hpage)) rc = move_to_new_page(new_hpage, hpage, 1, mode); diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..1a0ab2b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -603,6 +603,57 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma) return vma_address(page, vma); } +pte_t *__page_check_volatile_address(struct page *page, struct mm_struct *mm, + unsigned long address, spinlock_t **ptlp) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + spinlock_t *ptl; + + swp_entry_t entry = { .val = page_private(page) }; + + if (unlikely(PageHuge(page))) { + pte = huge_pte_offset(mm, address); + ptl = &mm->page_table_lock; + goto check; + } + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + return NULL; + + pud = pud_offset(pgd, address); + if (!pud_present(*pud)) + return NULL; + + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) + return NULL; + if (pmd_trans_huge(*pmd)) + return NULL; + + pte = pte_offset_map(pmd, address); + ptl = pte_lockptr(mm, pmd); +check: + spin_lock(ptl); + if (PageAnon(page)) { + if (!pte_present(*pte) && entry.val == + pte_to_swp_entry(*pte).val) { + *ptlp = ptl; + return pte; + } + } else { + if (pte_none(*pte)) { + *ptlp = ptl; + return pte; + } + } + pte_unmap_unlock(pte, ptl); + return NULL; +} + /* * Check that @page is mapped at @address into @mm. * @@ -1218,6 +1269,35 @@ out: mem_cgroup_end_update_page_stat(page, &locked, &flags); } +int try_to_zap_one(struct page *page, struct vm_area_struct *vma, + unsigned long address) +{ + struct mm_struct *mm = vma->vm_mm; + pte_t *pte; + pte_t pteval; + spinlock_t *ptl; + + pte = page_check_volatile_address(page, mm, address, &ptl); + if (!pte) + return 0; + + /* Nuke the page table entry. */ + flush_cache_page(vma, address, page_to_pfn(page)); + pteval = ptep_clear_flush(vma, address, pte); + + if (PageAnon(page)) { + swp_entry_t entry = { .val = page_private(page) }; + if (PageSwapCache(page)) { + dec_mm_counter(mm, MM_SWAPENTS); + swap_free(entry); + } + } + + pte_unmap_unlock(pte, ptl); + mmu_notifier_invalidate_page(mm, address); + return 1; +} + /* * Subfunctions of try_to_unmap: try_to_unmap_one called * repeatedly from try_to_unmap_ksm, try_to_unmap_anon or try_to_unmap_file. @@ -1494,6 +1574,10 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) struct anon_vma *anon_vma; struct anon_vma_chain *avc; int ret = SWAP_AGAIN; + bool is_volatile = true; + + if (flags & TTU_IGNORE_VOLATILE) + is_volatile = false; anon_vma = page_lock_anon_vma(page); if (!anon_vma) @@ -1512,17 +1596,40 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) * temporary VMAs until after exec() completes. */ if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && - is_vma_temporary_stack(vma)) + is_vma_temporary_stack(vma)) { + is_volatile = false; continue; + } address = vma_address(page, vma); if (address == -EFAULT) continue; + /* + * A volatile page will only be purged if ALL vmas + * pointing to it are VM_VOLATILE. + */ + if (!(vma->vm_flags & VM_VOLATILE)) + is_volatile = false; + ret = try_to_unmap_one(page, vma, address, flags); if (ret != SWAP_AGAIN || !page_mapped(page)) break; } + if (page_mapped(page) || is_volatile == false) + goto out; + + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { + struct vm_area_struct *vma = avc->vma; + unsigned long address; + + address = vma_address(page, vma); + try_to_zap_one(page, vma, address); + } + /* We're throwing this page out, so mark it clean */ + ClearPageDirty(page); + ret = SWAP_DISCARD; +out: page_unlock_anon_vma(anon_vma); return ret; } @@ -1651,6 +1758,7 @@ out: * SWAP_AGAIN - we missed a mapping, try again later * SWAP_FAIL - the page is unswappable * SWAP_MLOCK - page is mlocked. + * SWAP_DISCARD - page is volatile. */ int try_to_unmap(struct page *page, enum ttu_flags flags) { @@ -1665,7 +1773,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) ret = try_to_unmap_anon(page, flags); else ret = try_to_unmap_file(page, flags); - if (ret != SWAP_MLOCK && !page_mapped(page)) + if (ret != SWAP_MLOCK && !page_mapped(page) && + ret != SWAP_DISCARD) ret = SWAP_SUCCESS; return ret; } @@ -1707,6 +1816,18 @@ void __put_anon_vma(struct anon_vma *anon_vma) anon_vma_free(anon_vma); } +void volatile_lock(struct vm_area_struct *vma) +{ + if (vma->anon_vma) + anon_vma_lock(vma->anon_vma); +} + +void volatile_unlock(struct vm_area_struct *vma) +{ + if (vma->anon_vma) + anon_vma_unlock(vma->anon_vma); +} + #ifdef CONFIG_MIGRATION /* * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file(): diff --git a/mm/vmscan.c b/mm/vmscan.c index 99b434b..4e463a4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -630,6 +630,9 @@ static enum page_references page_check_references(struct page *page, if (vm_flags & VM_LOCKED) return PAGEREF_RECLAIM; + if (vm_flags & VM_VOLATILE) + return PAGEREF_RECLAIM; + if (referenced_ptes) { if (PageSwapBacked(page)) return PAGEREF_ACTIVATE; @@ -789,6 +792,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ if (page_mapped(page) && mapping) { switch (try_to_unmap(page, TTU_UNMAP)) { + case SWAP_DISCARD: + count_vm_event(PGVOLATILE); + goto discard_page; case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: @@ -857,6 +863,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } +discard_page: /* * If the page has buffers, try to free the buffer mappings * associated with this page. If we succeed we try to free diff --git a/mm/vmstat.c b/mm/vmstat.c index df7a674..410caf5 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -734,6 +734,7 @@ const char * const vmstat_text[] = { TEXTS_FOR_ZONES("pgalloc") "pgfree", + "pgvolatile", "pgactivate", "pgdeactivate", -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>