On Tue, Jul 08, 2014 at 03:03:38PM +0900, Minchan Kim wrote: > Linux doesn't have an ability to free pages lazy while other OS > already have been supported that named by madvise(MADV_FREE). > > The gain is clear that kernel can discard freed pages rather than > swapping out or OOM if memory pressure happens. > > Without memory pressure, freed pages would be reused by userspace > without another additional overhead(ex, page fault + allocation > + zeroing). > > How to work is following as. > > When madvise syscall is called, VM clears dirty bit of ptes of > the range. If memory pressure happens, VM checks dirty bit of > page table and if it found still "clean", it means it's a > "lazyfree pages" so VM could discard the page instead of swapping out. > Once there was store operation for the page before VM peek a page > to reclaim, dirty bit is set so VM can swap out the page instead of > discarding. > > Firstly, heavy users would be general allocators(ex, jemalloc, > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already > have supported the feature for other OS(ex, FreeBSD) > > barrios@blaptop:~/benchmark/ebizzy$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 42 > Stepping: 7 > CPU MHz: 2801.000 > BogoMIPS: 5581.64 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 4096K > NUMA node0 CPU(s): 0-3 > > ebizzy benchmark(./ebizzy -S 10 -n 512) > > vanilla-jemalloc MADV_free-jemalloc > > 1 thread > records: 10 records: 10 > avg: 7682.10 avg: 15306.10 > std: 62.35(0.81%) std: 347.99(2.27%) > max: 7770.00 max: 15622.00 > min: 7598.00 min: 14772.00 > > 2 thread > records: 10 records: 10 > avg: 12747.50 avg: 24171.00 > std: 792.06(6.21%) std: 895.18(3.70%) > max: 13337.00 max: 26023.00 > min: 10535.00 min: 23152.00 > > 4 thread > records: 10 records: 10 > avg: 16474.60 avg: 33717.90 > std: 1496.45(9.08%) std: 2008.97(5.96%) > max: 17877.00 max: 35958.00 > min: 12224.00 min: 29565.00 > > 8 thread > records: 10 records: 10 > avg: 16778.50 avg: 33308.10 > std: 825.53(4.92%) std: 1668.30(5.01%) > max: 17543.00 max: 36010.00 > min: 14576.00 min: 29577.00 > > 16 thread > records: 10 records: 10 > avg: 20614.40 avg: 35516.30 > std: 602.95(2.92%) std: 1283.65(3.61%) > max: 21753.00 max: 37178.00 > min: 19605.00 min: 33217.00 > > 32 thread > records: 10 records: 10 > avg: 22771.70 avg: 36018.50 > std: 598.94(2.63%) std: 1046.76(2.91%) > max: 24035.00 max: 37266.00 > min: 22108.00 min: 34149.00 > > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED. > > Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx> > Cc: Linux API <linux-api@xxxxxxxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Jason Evans <je@xxxxxx> > Cc: Zhang Yanfei <zhangyanfei@xxxxxxxxxxxxxx> > Acked-by: Rik van Riel <riel@xxxxxxxxxx> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> > --- > include/linux/rmap.h | 9 ++- > include/linux/vm_event_item.h | 1 + > include/uapi/asm-generic/mman-common.h | 1 + > mm/madvise.c | 135 +++++++++++++++++++++++++++++++++ > mm/rmap.c | 42 +++++++++- > mm/vmscan.c | 40 ++++++++-- > mm/vmstat.c | 1 + > 7 files changed, 217 insertions(+), 12 deletions(-) > ... > @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma, > return 0; > } > > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > + > +{ > + struct madvise_free_private *fp = walk->private; > + struct mmu_gather *tlb = fp->tlb; > + struct mm_struct *mm = tlb->mm; > + struct vm_area_struct *vma = fp->vma; > + spinlock_t *ptl; > + pte_t *pte, ptent; > + struct page *page; > + > + split_huge_page_pmd(vma, addr, pmd); > + if (pmd_trans_unstable(pmd)) > + return 0; > + > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > + arch_enter_lazy_mmu_mode(); > + for (; addr != end; pte++, addr += PAGE_SIZE) { > + ptent = *pte; > + > + if (pte_none(ptent)) > + continue; The check is redundant: all pte_none() entries are also !pte_present(). > + > + if (!pte_present(ptent)) > + continue; > + > + page = vm_normal_page(vma, addr, ptent); > + if (page && PageSwapCache(page)) { > + if (trylock_page(page)) { > + if (try_to_free_swap(page)) > + ClearPageDirty(page); > + unlock_page(page); > + } else > + continue; > + } Is it safe to touch non-vm_normal entries? I would suggest to put if (!page) continue; instead. > + /* > + * Some of architecture(ex, PPC) don't update TLB > + * with set_pte_at and tlb_remove_tlb_entry so for > + * the portability, remap the pte with old|clean > + * after pte clearing. > + */ > + ptent = ptep_get_and_clear_full(mm, addr, pte, > + tlb->fullmm); > + ptent = pte_mkold(ptent); > + ptent = pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, addr); > + } > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(pte - 1, ptl); > + cond_resched(); > + return 0; > +} > + > +static void madvise_free_page_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, > + unsigned long addr, unsigned long end) > +{ > + struct madvise_free_private fp = { > + .vma = vma, > + .tlb = tlb, > + }; > + > + struct mm_walk free_walk = { > + .pmd_entry = madvise_free_pte_range, > + .mm = vma->vm_mm, > + .private = &fp, > + }; > + > + BUG_ON(addr >= end); > + tlb_start_vma(tlb, vma); > + walk_page_range(addr, end, &free_walk); > + tlb_end_vma(tlb, vma); > +} > + > +static int madvise_free_single_vma(struct vm_area_struct *vma, > + unsigned long start_addr, unsigned long end_addr) > +{ > + unsigned long start, end; > + struct mm_struct *mm = vma->vm_mm; > + struct mmu_gather tlb; > + > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) > + return -EINVAL; VM_MIXEDMAP? VM_IO? Should it be whitelist instead? > + > + /* MADV_FREE works for only anon vma at the moment */ > + if (vma->vm_file) > + return -EINVAL; > + > + start = max(vma->vm_start, start_addr); > + if (start >= vma->vm_end) > + return -EINVAL; > + end = min(vma->vm_end, end_addr); > + if (end <= vma->vm_start) > + return -EINVAL; > + > + lru_add_drain(); > + tlb_gather_mmu(&tlb, mm, start, end); > + update_hiwater_rss(mm); > + > + mmu_notifier_invalidate_range_start(mm, start, end); > + madvise_free_page_range(&tlb, vma, start, end); > + mmu_notifier_invalidate_range_end(mm, start, end); > + tlb_finish_mmu(&tlb, start, end); > + > + return 0; > +} > + > +static long madvise_free(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + *prev = vma; > + return madvise_free_single_vma(vma, start, end); > +} > + > /* > * Application no longer needs these pages. If the pages are dirty, > * it's OK to just throw them away. The app will be more careful about > @@ -381,6 +508,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > return madvise_remove(vma, prev, start, end); > case MADV_WILLNEED: > return madvise_willneed(vma, prev, start, end); > + case MADV_FREE: > + /* > + * XXX: In this implementation, MADV_FREE works like > + * MADV_DONTNEED on swapless system or full swap. > + */ > + if (get_nr_swap_pages() > 0) > + return madvise_free(vma, prev, start, end); /* passthough */ > case MADV_DONTNEED: > return madvise_dontneed(vma, prev, start, end); > default: ... > @@ -1186,6 +1210,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > swp_entry_t entry = { .val = page_private(page) }; > pte_t swp_pte; > > + if (flags & TTU_FREE) { > + VM_BUG_ON_PAGE(PageSwapCache(page), page); > + if (dirty || PageDirty(page)) { > + set_pte_at(mm, address, pte, pteval); > + ret = SWAP_FAIL; > + goto out_unmap; Hm. Again: do we really want stop here if caller asks for TTU_FREE|TTU_UNMAP or should proceed? > + } else { > + /* It's a freeable page by MADV_FREE */ > + dec_mm_counter(mm, MM_ANONPAGES); > + goto discard; > + } > + } > + > if (PageSwapCache(page)) { > /* > * Store the swap location in the pte. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html