On Tue, Nov 19, 2019 at 08:23:17PM +0800, Alex Shi wrote: > This patchset move lru_lock into lruvec, give a lru_lock for each of > lruvec, thus bring a lru_lock for each of memcg per node. > > This is the main patch to replace per node lru_lock with per memcg > lruvec lock. > > We introduce function lock_page_lruvec, it's same as vanilla pgdat lock > when memory cgroup unset, w/o memcg, the function will keep repin the > lruvec's lock to guard from page->mem_cgroup changes in page > migrations between memcgs. (Thanks Hugh Dickins and Konstantin > Khlebnikov reminder on this. Than the core logical is same as their > previous patchs) > > According to Daniel Jordan's suggestion, I run 64 'dd' with on 32 > containers on my 2s* 8 core * HT box with the modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > With this and later patches, the dd performance is 144MB/s, the vanilla > kernel performance is 123MB/s. 17% performance increased. > > Signed-off-by: Alex Shi <alex.shi@xxxxxxxxxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Roman Gushchin <guro@xxxxxx> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx> > Cc: Chris Down <chris@xxxxxxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Cc: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Qian Cai <cai@xxxxxx> > Cc: Andrey Ryabinin <aryabinin@xxxxxxxxxxxxx> > Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> > Cc: "Jérôme Glisse" <jglisse@xxxxxxxxxx> > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> > Cc: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> > Cc: swkhack <swkhack@xxxxxxxxx> > Cc: "Potyra, Stefan" <Stefan.Potyra@xxxxxxxxxxxxxx> > Cc: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > Cc: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx> > Cc: Colin Ian King <colin.king@xxxxxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxx> > Cc: Mauro Carvalho Chehab <mchehab+samsung@xxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Peng Fan <peng.fan@xxxxxxx> > Cc: Nikolay Borisov <nborisov@xxxxxxxx> > Cc: Ira Weiny <ira.weiny@xxxxxxxxx> > Cc: Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> > Cc: Yafang Shao <laoar.shao@xxxxxxxxx> > Cc: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: linux-kernel@xxxxxxxxxxxxxxx > Cc: linux-mm@xxxxxxxxx > Cc: cgroups@xxxxxxxxxxxxxxx > --- > include/linux/memcontrol.h | 24 +++++++++++++++ > include/linux/mmzone.h | 2 ++ > mm/compaction.c | 67 ++++++++++++++++++++++++++++------------- > mm/huge_memory.c | 15 ++++------ > mm/memcontrol.c | 75 +++++++++++++++++++++++++++++++++++----------- > mm/mlock.c | 31 ++++++++++--------- > mm/mmzone.c | 1 + > mm/page_idle.c | 5 ++-- > mm/swap.c | 74 +++++++++++++++++++-------------------------- > mm/vmscan.c | 58 +++++++++++++++++------------------ > 10 files changed, 214 insertions(+), 138 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5b86287fa069..9538253998a6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -418,6 +418,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, > > struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *); > > +struct lruvec *lock_page_lruvec_irq(struct page *, struct pglist_data *); > +struct lruvec *lock_page_lruvec_irqsave(struct page *, struct pglist_data *, > + unsigned long*); > + > struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > > struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); > @@ -901,6 +905,26 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, > return &pgdat->__lruvec; > } > > +static inline struct lruvec *lock_page_lruvec_irq(struct page *page, > + struct pglist_data *pgdat) > +{ > + struct lruvec *lruvec = mem_cgroup_page_lruvec(page, pgdat); > + > + spin_lock_irq(&lruvec->lru_lock); > + > + return lruvec; While this works in practice, it looks wrong because it doesn't follow the mem_cgroup_page_lruvec() rules. Please open-code spin_lock_irq(&pgdat->__lruvec->lru_lock) instead. > @@ -1246,6 +1245,46 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd > return lruvec; > } > > +struct lruvec *lock_page_lruvec_irq(struct page *page, > + struct pglist_data *pgdat) > +{ > + struct lruvec *lruvec; > + > +again: > + rcu_read_lock(); > + lruvec = mem_cgroup_page_lruvec(page, pgdat); > + spin_lock_irq(&lruvec->lru_lock); > + rcu_read_unlock(); The spinlock doesn't prevent the lruvec from being freed. You deleted the rules from the mem_cgroup_page_lruvec() documentation, but they still apply: if the page is already !PageLRU() by the time you get here, it could get reclaimed or migrated to another cgroup, and that can free the memcg/lruvec. Merely having the lru_lock held does not prevent this. Either the page needs to be locked, or the page needs to be PageLRU with the lru_lock held to prevent somebody else from isolating it. Otherwise, the lruvec is not safe to use.