Re: [PATCH RFC 1/2] mm: collect the number of anon large folios

Barry Song <21cnbao@xxxxxxxxx> · Thu, 8 Aug 2024 21:20:17 +1200

On Thu, Aug 8, 2024 at 8:17 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 08.08.24 10:08, David Hildenbrand wrote:
> > On 08.08.24 10:03, David Hildenbrand wrote:
> >> On 08.08.24 09:08, Barry Song wrote:
> >>> On Thu, Aug 8, 2024 at 1:05 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>>>
> >>>> From: Barry Song <v-songbaohua@xxxxxxxx>
> >>>>
> >>>> When a new anonymous mTHP is added to the rmap, we increase the count.
> >>>> We reduce the count whenever an mTHP is completely unmapped.
> >>>>
> >>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> >>>> ---
> >>>>     Documentation/admin-guide/mm/transhuge.rst |  5 +++++
> >>>>     include/linux/huge_mm.h                    | 15 +++++++++++++--
> >>>>     mm/huge_memory.c                           |  2 ++
> >>>>     mm/rmap.c                                  |  3 +++
> >>>>     4 files changed, 23 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> >>>> index 058485daf186..715f181543f6 100644
> >>>> --- a/Documentation/admin-guide/mm/transhuge.rst
> >>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
> >>>> @@ -527,6 +527,11 @@ split_deferred
> >>>>             it would free up some memory. Pages on split queue are going to
> >>>>             be split under memory pressure, if splitting is possible.
> >>>>
> >>>> +anon_num
> >>>> +       the number of anon huge pages we have in the whole system.
> >>>> +       These huge pages could be still entirely mapped and have partially
> >>>> +       unmapped and unused subpages.
> >>>> +
> >>>>     As the system ages, allocating huge pages may be expensive as the
> >>>>     system uses memory compaction to copy data around memory to free a
> >>>>     huge page for use. There are some counters in ``/proc/vmstat`` to help
> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>> index e25d9ebfdf89..294c348fe3cc 100644
> >>>> --- a/include/linux/huge_mm.h
> >>>> +++ b/include/linux/huge_mm.h
> >>>> @@ -281,6 +281,7 @@ enum mthp_stat_item {
> >>>>            MTHP_STAT_SPLIT,
> >>>>            MTHP_STAT_SPLIT_FAILED,
> >>>>            MTHP_STAT_SPLIT_DEFERRED,
> >>>> +       MTHP_STAT_NR_ANON,
> >>>>            __MTHP_STAT_COUNT
> >>>>     };
> >>>>
> >>>> @@ -291,14 +292,24 @@ struct mthp_stat {
> >>>>     #ifdef CONFIG_SYSFS
> >>>>     DECLARE_PER_CPU(struct mthp_stat, mthp_stats);
> >>>>
> >>>> -static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> >>>> +static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
> >>>>     {
> >>>>            if (order <= 0 || order > PMD_ORDER)
> >>>>                    return;
> >>>>
> >>>> -       this_cpu_inc(mthp_stats.stats[order][item]);
> >>>> +       this_cpu_add(mthp_stats.stats[order][item], delta);
> >>>> +}
> >>>> +
> >>>> +static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> >>>> +{
> >>>> +       mod_mthp_stat(order, item, 1);
> >>>>     }
> >>>> +
> >>>>     #else
> >>>> +static inline void mod_mthp_stat(int order, enum mthp_stat_item item, int delta)
> >>>> +{
> >>>> +}
> >>>> +
> >>>>     static inline void count_mthp_stat(int order, enum mthp_stat_item item)
> >>>>     {
> >>>>     }
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 697fcf89f975..b6bc2a3791e3 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -578,6 +578,7 @@ DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE);
> >>>>     DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT);
> >>>>     DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> >>>>     DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> >>>> +DEFINE_MTHP_STAT_ATTR(anon_num, MTHP_STAT_NR_ANON);
> >>>>
> >>>>     static struct attribute *stats_attrs[] = {
> >>>>            &anon_fault_alloc_attr.attr,
> >>>> @@ -591,6 +592,7 @@ static struct attribute *stats_attrs[] = {
> >>>>            &split_attr.attr,
> >>>>            &split_failed_attr.attr,
> >>>>            &split_deferred_attr.attr,
> >>>> +       &anon_num_attr.attr,
> >>>>            NULL,
> >>>>     };
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 901950200957..2b722f26224c 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1467,6 +1467,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> >>>>            }
> >>>>
> >>>>            __folio_mod_stat(folio, nr, nr_pmdmapped);
> >>>> +       mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON, 1);
> >>>>     }
> >>>>
> >>>>     static __always_inline void __folio_add_file_rmap(struct folio *folio,
> >>>> @@ -1582,6 +1583,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> >>>>                list_empty(&folio->_deferred_list))
> >>>>                    deferred_split_folio(folio);
> >>>>            __folio_mod_stat(folio, -nr, -nr_pmdmapped);
> >>>> +       if (folio_test_anon(folio) && !atomic_read(mapped))
> >>>
> >>> could have a risk here two processes unmap at the same time, so
> >>> they both get zero on atomic_read(mapped)? should read the value
> >>> of atomic_dec_return() instead to confirm we are the last one
> >>> doing unmap?
> >>
> >> I would appreciate if we leave the rmap out here.
> >>
> >> Can't we handle that when actually freeing the folio? folio_test_anon()
> >> is sticky until freed.
> >
> > To be clearer: we increment the counter when we set a folio anon, which
> > should indeed only happen in folio_add_new_anon_rmap(). We'll have to
> > ignore hugetlb here where we do it in hugetlb_add_new_anon_rmap().
> >
> > Then, when we free an anon folio we decrement the counter. (hugetlb
> > should clear the anon flag when an anon folio gets freed back to its
> > allocator -- likely that is already done).
> >
>
> Sorry that I am talking to myself: I'm wondering if we also have to
> adjust the counter when splitting a large folio to multiple
> smaller-but-still-large folios.

yes, if we don't use remove_rmap. because we could allocate them as
mTHP but free them as nr_pages small folios.

>
> --
> Cheers,
>
> David / dhildenb
>