On Thu, Nov 12, 2015 at 11:44:53AM -0800, Shaohua Li wrote: > On Thu, Nov 12, 2015 at 01:33:13PM +0900, Minchan Kim wrote: > > MADV_FREEed page's hotness is very arguble. > > Someone think it's hot while others are it's cold. > > > > Quote from Shaohua > > " > > My main concern is the policy how we should treat the FREE pages. Moving it to > > inactive lru is definitionly a good start, I'm wondering if it's enough. The > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of > > the lazy memory free. While MADV_FREE is intended to be a better replacement of > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory > > pressure too. I'm thinking of adding an extra lru list and wartermark for this > > to make sure FREE pages can be freed before system wide page reclaim. As you > > said, this is arguable, but I hope we can discuss about this issue more. > > " > > > > Quote from me > > " > > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED. > > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree > > but it's not true because the page would be dirty state when VM want to reclaim. > > > > I'm also against with your's suggestion which let's discard FREEed page before > > system wide page reclaim because system would have lots of clean cold page > > caches or anonymous pages. In such case, reclaiming of them would be better. > > Yeb, it's really workload-dependent so we might need some heuristic which is > > normally what we want to avoid. > > > > Having said that, I agree with you we could do better than the deactivation > > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named > > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez) > > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count. > > When the MADV_FREE is called, we could move hinted pages from anon-LRU to > > ez-LRU and then If VM find to not be able to discard a page in ez-LRU, > > it could promote it to acive-anon-LRU which would be very natural aging > > concept because it mean someone touches the page recenlty. > > With that, I don't want to bias one side and don't want to add some knob for > > tuning the heuristic but let's rely on common fair aging scheme of VM. > > " > > > > Quote from Johannes > > " > > thread 1: > > Even if we're wrong about the aging of those MADV_FREE pages, their > > contents are invalidated; they can be discarded freely, and restoring > > them is a mere GFP_ZERO allocation. All other anonymous pages have to > > be written to disk, and potentially be read back. > > > > [ Arguably, MADV_FREE pages should even be reclaimed before inactive > > page cache. It's the same cost to discard both types of pages, but > > restoring page cache involves IO. ] > > > > It probably makes sense to stop thinking about them as anonymous pages > > entirely at this point when it comes to aging. They're really not. The > > LRU lists are split to differentiate access patterns and cost of page > > stealing (and restoring). From that angle, MADV_FREE pages really have > > nothing in common with in-use anonymous pages, and so they shouldn't > > be on the same LRU list. > > > > thread:2 > > What about them is hot? They contain garbage, you have to write to > > them before you can use them. Granted, you might have to refetch > > cachelines if you don't do cacheline-aligned populating writes, but > > you can do a lot of them before it's more expensive than doing IO. > > > > " > > > > Quote from Daniel > > " > > thread:1 > > Keep in mind that this is memory the kernel wouldn't be getting back at > > all if the allocator wasn't going out of the way to purge it, and they > > aren't going to go out of their way to purge it if it means the kernel > > is going to steal the pages when there isn't actually memory pressure. > > > > An allocator would be using MADV_DONTNEED if it didn't expect that the > > pages were going to be used against shortly. MADV_FREE indicates that it > > has time to inform the kernel that they're unused but they could still > > be very hot. > > > > thread:2 > > It's hot because applications churn through memory via the allocator. > > > > Drop the pages and the application is now churning through page faults > > and zeroing rather than simply reusing memory. It's not something that > > may happen, it *will* happen. A page in the page cache *may* be reused, > > but often won't be, especially when the I/O patterns don't line up well > > with the way it works. > > > > The whole point of the feature is not requiring the allocator to have > > elaborate mechanisms for aging pages and throttling purging. That ends > > up resulting in lots of memory held by userspace where the kernel can't > > reclaim it under memory pressure. If it's dropped before page cache, it > > isn't going to be able to replace any of that logic in allocators. > > > > The page cache is speculative. Page caching by allocators is not really > > speculative. Using MADV_FREE on the pages at all is speculative. The > > memory is probably going to be reused fairly soon (unless the process > > exits, and then it doesn't matter), but purging will end up reducing > > memory usage for the portions that aren't. > > > > It would be a different story for a full unpinning/pinning feature since > > that would have other use cases (speculative caches), but this is really > > only useful in allocators. > > " > > You could read all thread from https://lkml.org/lkml/2015/11/4/51 > > > > Yeah, with arguble issue and there is no one decision, I think it > > means we should provide the knob "lazyfreeness"(I hope someone > > give better naming). > > > > It's similar to swapppiness so higher values will discard MADV_FREE > > pages agreessively. If memory pressure happens and system works with > > DEF_PRIOIRTY(ex, clean cold caches), VM doesn't discard any hinted > > pages until the scanning priority is increased. > > > > If memory pressure is higher(ie, the priority is not DEF_PRIORITY), > > it scans > > > > nr_to_reclaim * priority * lazyfreensss(def: 20) / 50 > > > > If system has low free memory and file cache, it start to discard > > MADV_FREEed pages unconditionally even though user set lazyfreeness to 0. > > > > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx> > > --- > > Documentation/sysctl/vm.txt | 13 +++++++++ > > drivers/base/node.c | 4 +-- > > fs/proc/meminfo.c | 4 +-- > > include/linux/memcontrol.h | 1 + > > include/linux/mmzone.h | 9 +++--- > > include/linux/swap.h | 15 ++++++++++ > > kernel/sysctl.c | 9 ++++++ > > mm/memcontrol.c | 32 +++++++++++++++++++++- > > mm/vmscan.c | 67 ++++++++++++++++++++++++++++----------------- > > mm/vmstat.c | 2 +- > > 10 files changed, 121 insertions(+), 35 deletions(-) > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > > index a4482fceacec..c1dc63381f2c 100644 > > --- a/Documentation/sysctl/vm.txt > > +++ b/Documentation/sysctl/vm.txt > > @@ -56,6 +56,7 @@ files can be found in mm/swap.c. > > - percpu_pagelist_fraction > > - stat_interval > > - swappiness > > +- lazyfreeness > > - user_reserve_kbytes > > - vfs_cache_pressure > > - zone_reclaim_mode > > @@ -737,6 +738,18 @@ The default value is 60. > > > > ============================================================== > > > > +lazyfreeness > > + > > +This control is used to define how aggressive the kernel will discard > > +MADV_FREE hinted pages. Higher values will increase agressiveness, > > +lower values decrease the amount of discarding. A value of 0 instructs > > +the kernel not to initiate discarding until the amount of free and > > +file-backed pages is less than the high water mark in a zone. > > + > > +The default value is 20. > > + > > +============================================================== > > + > > - user_reserve_kbytes > > > > When overcommit_memory is set to 2, "never overcommit" mode, reserve > > diff --git a/drivers/base/node.c b/drivers/base/node.c > > index f7a1f2107b43..3b0bf1b78b2e 100644 > > --- a/drivers/base/node.c > > +++ b/drivers/base/node.c > > @@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev, > > "Node %d Inactive(anon): %8lu kB\n" > > "Node %d Active(file): %8lu kB\n" > > "Node %d Inactive(file): %8lu kB\n" > > - "Node %d Unevictable: %8lu kB\n" > > "Node %d LazyFree: %8lu kB\n" > > + "Node %d Unevictable: %8lu kB\n" > > "Node %d Mlocked: %8lu kB\n", > > nid, K(i.totalram), > > nid, K(i.freeram), > > @@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev, > > nid, K(node_page_state(nid, NR_INACTIVE_ANON)), > > nid, K(node_page_state(nid, NR_ACTIVE_FILE)), > > nid, K(node_page_state(nid, NR_INACTIVE_FILE)), > > - nid, K(node_page_state(nid, NR_UNEVICTABLE)), > > nid, K(node_page_state(nid, NR_LZFREE)), > > + nid, K(node_page_state(nid, NR_UNEVICTABLE)), > > nid, K(node_page_state(nid, NR_MLOCK))); > > > > #ifdef CONFIG_HIGHMEM > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > > index 3444f7c4e0b6..f47e6a5aa2e5 100644 > > --- a/fs/proc/meminfo.c > > +++ b/fs/proc/meminfo.c > > @@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > "Inactive(anon): %8lu kB\n" > > "Active(file): %8lu kB\n" > > "Inactive(file): %8lu kB\n" > > - "Unevictable: %8lu kB\n" > > "LazyFree: %8lu kB\n" > > + "Unevictable: %8lu kB\n" > > "Mlocked: %8lu kB\n" > > #ifdef CONFIG_HIGHMEM > > "HighTotal: %8lu kB\n" > > @@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > K(pages[LRU_INACTIVE_ANON]), > > K(pages[LRU_ACTIVE_FILE]), > > K(pages[LRU_INACTIVE_FILE]), > > - K(pages[LRU_UNEVICTABLE]), > > K(pages[LRU_LZFREE]), > > + K(pages[LRU_UNEVICTABLE]), > > K(global_page_state(NR_MLOCK)), > > #ifdef CONFIG_HIGHMEM > > K(i.totalhigh), > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 3e3318ddfc0e..5522ff733506 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -210,6 +210,7 @@ struct mem_cgroup { > > int under_oom; > > > > int swappiness; > > + int lzfreeness; > > /* OOM-Killer disable */ > > int oom_kill_disable; > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 1aaa436da0d5..cca514a9701d 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -120,8 +120,8 @@ enum zone_stat_item { > > NR_ACTIVE_ANON, /* " " " " " */ > > NR_INACTIVE_FILE, /* " " " " " */ > > NR_ACTIVE_FILE, /* " " " " " */ > > - NR_UNEVICTABLE, /* " " " " " */ > > NR_LZFREE, /* " " " " " */ > > + NR_UNEVICTABLE, /* " " " " " */ > > NR_MLOCK, /* mlock()ed pages found and moved off LRU */ > > NR_ANON_PAGES, /* Mapped anonymous pages */ > > NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. > > @@ -179,14 +179,15 @@ enum lru_list { > > LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE, > > LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE, > > LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE, > > - LRU_UNEVICTABLE, > > LRU_LZFREE, > > + LRU_UNEVICTABLE, > > NR_LRU_LISTS > > }; > > > > #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++) > > - > > -#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) > > +#define for_each_anon_file_lru(lru) \ > > + for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++) > > +#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++) > > > > static inline int is_file_lru(enum lru_list lru) > > { > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index f0310eeab3ec..73bcdc9d0e88 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -330,6 +330,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, > > unsigned long *nr_scanned); > > extern unsigned long shrink_all_memory(unsigned long nr_pages); > > extern int vm_swappiness; > > +extern int vm_lazyfreeness; > > extern int remove_mapping(struct address_space *mapping, struct page *page); > > extern unsigned long vm_total_pages; > > > > @@ -361,11 +362,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) > > return memcg->swappiness; > > } > > > > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg) > > +{ > > + /* root ? */ > > + if (mem_cgroup_disabled() || !memcg->css.parent) > > + return vm_lazyfreeness; > > + > > + return memcg->lzfreeness; > > +} > > + > > #else > > static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) > > { > > return vm_swappiness; > > } > > + > > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem) > > +{ > > + return vm_lazyfreeness; > > +} > > #endif > > #ifdef CONFIG_MEMCG_SWAP > > extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > > index e69201d8094e..2496b10c08e9 100644 > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = { > > .extra1 = &zero, > > .extra2 = &one_hundred, > > }, > > + { > > + .procname = "lazyfreeness", > > + .data = &vm_lazyfreeness, > > + .maxlen = sizeof(vm_lazyfreeness), > > + .mode = 0644, > > + .proc_handler = proc_dointvec_minmax, > > + .extra1 = &zero, > > + .extra2 = &one_hundred, > > + }, > > #ifdef CONFIG_HUGETLB_PAGE > > { > > .procname = "nr_hugepages", > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 1dc599ce1bcb..5bdbe2a20dc0 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = { > > "active_anon", > > "inactive_file", > > "active_file", > > - "unevictable", > > "lazyfree", > > + "unevictable", > > }; > > > > #define THRESHOLDS_EVENTS_TARGET 128 > > @@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css, > > return 0; > > } > > > > +static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css, > > + struct cftype *cft) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > > + > > + return mem_cgroup_lzfreeness(memcg); > > +} > > + > > +static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css, > > + struct cftype *cft, u64 val) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(css); > > + > > + if (val > 100) > > + return -EINVAL; > > + > > + if (css->parent) > > + memcg->lzfreeness = val; > > + else > > + vm_lazyfreeness = val; > > + > > + return 0; > > +} > > + > > static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) > > { > > struct mem_cgroup_threshold_ary *t; > > @@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = { > > .write_u64 = mem_cgroup_swappiness_write, > > }, > > { > > + .name = "lazyfreeness", > > + .read_u64 = mem_cgroup_lzfreeness_read, > > + .write_u64 = mem_cgroup_lzfreeness_write, > > + }, > > + { > > .name = "move_charge_at_immigrate", > > .read_u64 = mem_cgroup_move_charge_read, > > .write_u64 = mem_cgroup_move_charge_write, > > @@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > > memcg->use_hierarchy = parent->use_hierarchy; > > memcg->oom_kill_disable = parent->oom_kill_disable; > > memcg->swappiness = mem_cgroup_swappiness(parent); > > + memcg->lzfreeness = mem_cgroup_lzfreeness(parent); > > > > if (parent->use_hierarchy) { > > page_counter_init(&memcg->memory, &parent->memory); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index cd65db9d3004..f1abc8a6ca31 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -141,6 +141,10 @@ struct scan_control { > > */ > > int vm_swappiness = 60; > > /* > > + * From 0 .. 100. Higher means more lazy freeing. > > + */ > > +int vm_lazyfreeness = 20; > > +/* > > * The total number of pages which are beyond the high watermark within all > > * zones. > > */ > > @@ -2012,10 +2016,11 @@ enum scan_balance { > > * > > * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan > > * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan > > + * nr[4] = lazy free pages to scan; > > */ > > static void get_scan_count(struct lruvec *lruvec, int swappiness, > > - struct scan_control *sc, unsigned long *nr, > > - unsigned long *lru_pages) > > + int lzfreeness, struct scan_control *sc, > > + unsigned long *nr, unsigned long *lru_pages) > > { > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > u64 fraction[2]; > > @@ -2023,12 +2028,13 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness, > > struct zone *zone = lruvec_zone(lruvec); > > unsigned long anon_prio, file_prio; > > enum scan_balance scan_balance; > > - unsigned long anon, file; > > + unsigned long anon, file, lzfree; > > bool force_scan = false; > > unsigned long ap, fp; > > enum lru_list lru; > > bool some_scanned; > > int pass; > > + unsigned long scan_lzfree = 0; > > > > /* > > * If the zone or memcg is small, nr[l] can be 0. This > > @@ -2166,7 +2172,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness, > > /* Only use force_scan on second pass. */ > > for (pass = 0; !some_scanned && pass < 2; pass++) { > > *lru_pages = 0; > > - for_each_evictable_lru(lru) { > > + for_each_anon_file_lru(lru) { > > int file = is_file_lru(lru); > > unsigned long size; > > unsigned long scan; > > @@ -2212,6 +2218,28 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness, > > some_scanned |= !!scan; > > } > > } > > + > > + lzfree = get_lru_size(lruvec, LRU_LZFREE); > > + if (lzfree) { > > + scan_lzfree = sc->nr_to_reclaim * > > + (DEF_PRIORITY - sc->priority); > > scan_lzfree == 0 if sc->priority == DEF_PRIORITY, is this intended? > > + scan_lzfree = div64_u64(scan_lzfree * > > + lzfreeness, 50); > > + if (!scan_lzfree) { > > + unsigned long zonefile, zonefree; > > + > > + zonefree = zone_page_state(zone, NR_FREE_PAGES); > > + zonefile = zone_page_state(zone, NR_ACTIVE_FILE) + > > + zone_page_state(zone, NR_INACTIVE_FILE); > > + if (unlikely(zonefile + zonefree <= > > + high_wmark_pages(zone))) { > > + scan_lzfree = get_lru_size(lruvec, > > + LRU_LZFREE) >> sc->priority; > > + } > > + } > > + } > > + > > + nr[LRU_LZFREE] = min(scan_lzfree, lzfree); > > } > > Looks there is no setting to only reclaim lazyfree pages. Could we have an > option for this? It's legit we don't want to trash page cache because of > lazyfree memory. Once we introduc the knob, it could be doable. I will do it in next spin. Thanks for the review! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>