Re: [PATCH] mm: account lazily freed anon pages in NR_FILE_PAGES

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 6 Nov 2020 10:09:16 +0800

On Fri, Nov 6, 2020 at 12:24 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Thu, Nov 05, 2020 at 09:10:12PM +0800, Yafang Shao wrote:
> > The memory utilization (Used / Total) is used to monitor the memory
> > pressure by us. If it is too high, it means the system may be OOM sooner
> > or later when swap is off, then we will make adjustment on this system.
> >
> > However, this method is broken since MADV_FREE is introduced, because
> > these lazily free anonymous can be reclaimed under memory pressure while
> > they are still accounted in NR_ANON_MAPPED.
> >
> > Furthermore, since commit f7ad2a6cb9f7 ("mm: move MADV_FREE pages into
> > LRU_INACTIVE_FILE list"), these lazily free anonymous pages are moved
> > from anon lru list into file lru list. That means
> > (Inactive(file) + Active(file)) may be much larger than Cached in
> > /proc/meminfo. That makes our users confused.
> >
> > So we'd better account the lazily freed anonoymous pages in
> > NR_FILE_PAGES as well.
>
> What about the share of pages that have been reused? After all, the
> idea behind deferred reclaim is cheap reuse of already allocated and
> faulted in pages.
>

I missed the reuse case. Thanks for the explanation.

> Anywhere between 0% and 100% of MADV_FREEd pages may be dirty and need
> swap-out to reclaim. That means even after this patch, your formula
> would still have an error margin of 100%.
>
> The tradeoff with saving the reuse fault and relying on the MMU is
> that the kernel simply *cannot do* lazy free accounting. Userspace
> needs to do it. E.g. if a malloc implementation or similar uses
> MADV_FREE, it has to keep track of what is and isn't used and make
> those stats available.
>
> If that's not practical,

That is not practical. The process which uses MADV_FREE can keep track
of it, but other processes like monitor tools have no easier way to
keep track of it. We can't give the userspace trouble.

> I don't see an alternative to trapping minor
> faults upon page reuse, eating the additional TLB flush, and doing the
> accounting properly inside the kernel.
>

I will try to analyze the details and find whether there is some way
to track it in the kernel.

> > @@ -1312,8 +1312,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
> >       if (unlikely(PageMlocked(page)))
> >               clear_page_mlock(page);
> >
> > -     if (nr)
> > -             __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
> > +     if (nr) {
> > +             if (PageLRU(page) && PageAnon(page) && !PageSwapBacked(page) &&
> > +                 !PageSwapCache(page) && !PageUnevictable(page))
> > +                     __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
> > +             else
> > +                     __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
>
> I don't think this would work. The page can be temporarily off-LRU for
> compaction, migration, reclaim etc. and then you'd misaccount it here.

Right, thanks for the clarification.

-- 
Thanks
Yafang