On Thu, May 25, 2023 at 9:54 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Wed, May 24, 2023 at 05:12:54PM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > > > > The pages reclaimed by madvise_pageout are made of inactive and dropped from LRU > > forcefully, which lead to the coming up refault pages possess a large refault > > distance than it should be. These could affect the accuracy of thrashing when > > madvise_pageout is used as a common way of memory reclaiming as ANDROID does now. > > This alludes to, but doesn't explain, a real world usecase. More block io(wait_on_page_bit_common) observed during APP start in latest android version where user space memory reclaiming changes from in-kernel PPR to madvise_pageout. We believe that it could be related with inaccuracy of workingset. > > Yes, madvise_pageout() will record non-resident entries today. This > means refault and thrash detection is on for user-driven reclaim. > > So why is that undesirable? Let's raise an extreme scenario, that is, the tail page of LRU could experience a given refault distance without any in-kernel reclaiming and be wrongly deemed as inactive and get less protection. > > Today we measure and report the cost of reclaim and memory pressure > for physical memory shortages, cgroup limits, and user-driven cgroup > reclaim. Why should we not do the same for madv_pageout()? If the > userspace code that drives pageout has a bug and the result is extreme > thrashing, wouldn't you want to know that? Actually, the pages evicted by madv_cold/pageout from active_lru are not marked as WORKINGSET, which will surpass the thrashing account when it faults back and gets struck by IO. I think they should be treated in the same way in terms of SetPageWorkingset and lruvec->non-resident. Please refer to my previous patch "mm: mark folio as workingset in lru_deactivate_fn index 70e2063..4d1c14f 100644" > > Please explain the idea here better. > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx> > > --- > > include/linux/swap.h | 2 +- > > mm/madvise.c | 4 ++-- > > mm/vmscan.c | 8 +++++++- > > 3 files changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 2787b84..0312142 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -428,7 +428,7 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, > > extern int vm_swappiness; > > long remove_mapping(struct address_space *mapping, struct folio *folio); > > > > -extern unsigned long reclaim_pages(struct list_head *page_list); > > +extern unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *page_list); > > #ifdef CONFIG_NUMA > > extern int node_reclaim_mode; > > extern int sysctl_min_unmapped_ratio; > > diff --git a/mm/madvise.c b/mm/madvise.c > > index b6ea204..61c8d7b 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -420,7 +420,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > huge_unlock: > > spin_unlock(ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > return 0; > > } > > > > @@ -516,7 +516,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > arch_leave_lazy_mmu_mode(); > > pte_unmap_unlock(orig_pte, ptl); > > if (pageout) > > - reclaim_pages(&page_list); > > + reclaim_pages(mm, &page_list); > > cond_resched(); > > > > return 0; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 20facec..048c10b 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2741,12 +2741,14 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, > > return nr_reclaimed; > > } > > > > -unsigned long reclaim_pages(struct list_head *folio_list) > > +unsigned long reclaim_pages(struct mm_struct *mm, struct list_head *folio_list) > > { > > int nid; > > unsigned int nr_reclaimed = 0; > > LIST_HEAD(node_folio_list); > > unsigned int noreclaim_flag; > > + struct lruvec *lruvec; > > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > > > > if (list_empty(folio_list)) > > return nr_reclaimed; > > @@ -2764,10 +2766,14 @@ unsigned long reclaim_pages(struct list_head *folio_list) > > } > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > nid = folio_nid(lru_to_folio(folio_list)); > > } while (!list_empty(folio_list)); > > > > nr_reclaimed += reclaim_folio_list(&node_folio_list, NODE_DATA(nid)); > > + lruvec = &memcg->nodeinfo[nid]->lruvec; > > + workingset_age_nonresident(lruvec, -nr_reclaimed); > > The task might have moved cgroups in between, who knows what kind of > artifacts it will introduce if you wind back the wrong clock. > > If there are reclaim passes that shouldn't participate in non-resident > tracking, that should be plumbed through the stack to __remove_mapping > (which already has that bool reclaimed param to not record entries).