On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > The move_lock is a per-memcg lock, but the VM accounting code that > needs to acquire it comes from the page and follows page->mem_cgroup > under RCU protection. That means that the page becomes unlocked not > when we drop the move_lock, but when we update page->mem_cgroup. And > that assignment doesn't imply any memory ordering. If that pointer > write gets reordered against the reads of the page state - > page_mapped, PageDirty etc. the state may change while we rely on it > being stable and we can end up corrupting the counters. > > Place an SMP memory barrier to make sure we're done with all page > state by the time the new page->mem_cgroup becomes visible. > > Also replace the open-coded move_lock with a lock_page_memcg() to make > it more obvious what we're serializing against. > > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> > --- > mm/memcontrol.c | 26 ++++++++++++++------------ > 1 file changed, 14 insertions(+), 12 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5beea03dd58a..41f5ed79272e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5372,7 +5372,6 @@ static int mem_cgroup_move_account(struct page *page, > { > struct lruvec *from_vec, *to_vec; > struct pglist_data *pgdat; > - unsigned long flags; > unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1; > int ret; > bool anon; > @@ -5399,18 +5398,13 @@ static int mem_cgroup_move_account(struct page *page, > from_vec = mem_cgroup_lruvec(from, pgdat); > to_vec = mem_cgroup_lruvec(to, pgdat); > > - spin_lock_irqsave(&from->move_lock, flags); > + lock_page_memcg(page); > > if (!anon && page_mapped(page)) { > __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); > __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); > } > > - /* > - * move_lock grabbed above and caller set from->moving_account, so > - * mod_memcg_page_state will serialize updates to PageDirty. > - * So mapping should be stable for dirty pages. > - */ > if (!anon && PageDirty(page)) { > struct address_space *mapping = page_mapping(page); > > @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page, > } > > /* > + * All state has been migrated, let's switch to the new memcg. > + * > * It is safe to change page->mem_cgroup here because the page > - * is referenced, charged, and isolated - we can't race with > - * uncharging, charging, migration, or LRU putback. > + * is referenced, charged, isolated, and locked: we can't race > + * with (un)charging, migration, LRU putback, or anything else > + * that would rely on a stable page->mem_cgroup. > + * > + * Note that lock_page_memcg is a memcg lock, not a page lock, > + * to save space. As soon as we switch page->mem_cgroup to a > + * new memcg that isn't locked, the above state can change > + * concurrently again. Make sure we're truly done with it. > */ > + smp_mb(); You said theoretical race in the subject but the above comment convinced me that smp_mb() is required. So, why is the race still theoretical? > > - /* caller should have done css_get */ > - page->mem_cgroup = to; > + page->mem_cgroup = to; /* caller should have done css_get */ > > - spin_unlock_irqrestore(&from->move_lock, flags); > + __unlock_page_memcg(from); > > ret = 0; > > -- > 2.26.0 >