Re: [PATCH v2] mm: add per-order mTHP alloc_success and alloc_fail counters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 2, 2024 at 9:58 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 28/03/2024 09:51, Barry Song wrote:
> > From: Barry Song <v-songbaohua@xxxxxxxx>
> >
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. Presenting the
> > success rate of mTHP allocations appears to be pressing need.
> >
> > Recently, I've been experiencing significant difficulty debugging
> > performance improvements and regressions without these figures.
> > It's crucial for us to understand the true effectiveness of
> > mTHP in real-world scenarios, especially in systems with
> > fragmented memory.
> >
> > This patch sets up the framework for per-order mTHP counters,
> > starting with the introduction of alloc_success and alloc_fail
> > counters.  Incorporating additional counters should now be
> > straightforward as well.
> >
> > The initial two unsigned longs for each event are unused, given
> > that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity.
>
> Overall, this approach looks good to me - thanks for doing this!

Thank you for reviewing!

>
> >
> > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> > ---
> >  -v2:
> >  * move to sysfs and provide per-order counters; David, Ryan, Willy
> >  -v1:
> >  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@xxxxxxxxx/
> >
> >  include/linux/huge_mm.h | 17 +++++++++++++
> >  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c             |  3 +++
> >  3 files changed, 74 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e896ca4760f6..27fa26a22a8f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                         enforce_sysfs, orders);
> >  }
> >
> > +enum thp_event_item {
> > +     THP_ALLOC_SUCCESS,
> > +     THP_ALLOC_FAIL,
> > +     NR_THP_EVENT_ITEMS
> > +};
> > +
> > +struct thp_event_state {
> > +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> > +};
> > +
> > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> > +
> > +static inline void count_thp_event(int order, enum thp_event_item item)
>
> We previously concluded that we will likely want a "currently allocated THPs"
> counter, which will need both increment and decrement. So perhaps

I intend to handle this via incremental patches, but it's improbable
to occur this
month due to my packed schedule with multiple tasks. Presently, my top priority
regarding counting is the fallback ratio for allocations. Following
that, it would
likely be the SWAP allocation and fallback ratio to determine if the swapfile is
highly fragmented.

> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? And
> perhaps "event" isn't the right description either since it implies always
> counting upwards (to me at least). Although I guess you have borrowed from the
> vmstat pattern? That file has some instantaneous counters, so how does it decrement?
>

Counting represents just one aspect of vmstat. In this scenario, the values are
consistently increasing. For those metrics that could fluctuate in
either direction,
an entirely different mechanism is employed. These are typically managed through
global atomic operations.

/*
 * Zone and node-based page accounting with per cpu differentials.
 */
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];

void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
void __dec_zone_page_state(struct page *, enum zone_stat_item);

void __mod_node_page_state(struct pglist_data *, enum node_stat_item
item, long);
void __inc_node_page_state(struct page *, enum node_stat_item);
void __dec_node_page_state(struct page *, enum node_stat_item);

static inline void zone_page_state_add(long x, struct zone *zone,
                                 enum zone_stat_item item)
{
        atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_zone_stat[item]);
}

Counting is likely most efficiently handled with per-CPU copies and a summation
function to aggregate values from all CPUs.

> > +{
> > +     this_cpu_inc(thp_event_states.event[order][item]);
>
> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would save 16
> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_index()
> helper since the *_show() functions will need this too.

I actually put some words in commit messages

"The initial two unsigned longs for each event are unused, given
 that order-0 and order-1 are not mTHP. Nonetheless, this refinement
improves code clarity."

However, if conserving memory is a higher priority, I can utilize
PMD_ORDER - 1 in version 3.

>
> > +}
> > +
> >  #define transparent_hugepage_use_zero_page()                         \
> >       (transparent_hugepage_flags &                                   \
> >        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 1683de78c313..addd093d8410 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype = {
> >       .sysfs_ops = &kobj_sysfs_ops,
> >  };
> >
> > +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) = {{{0}}};
> > +
> > +static unsigned long sum_thp_events(int order, enum thp_event_item item)
> > +{
> > +     unsigned long sum = 0;
> > +     int cpu;
> > +
> > +     for_each_online_cpu(cpu) {
> > +             struct thp_event_state *this = &per_cpu(thp_event_states, cpu);
> > +
> > +             sum += this->event[order][item];
> > +     }
> > +
> > +     return sum;
> > +}
> > +
> > +static ssize_t alloc_success_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf)
> > +{
> > +     int order = to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_SUCCESS));
> > +}
> > +
> > +static ssize_t alloc_fail_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf)
> > +{
> > +     int order = to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_FAIL));
> > +}
> > +
> > +static struct kobj_attribute alloc_success_attr = __ATTR_RO(alloc_success);
> > +static struct kobj_attribute alloc_fail_attr = __ATTR_RO(alloc_fail);
> > +
> > +static struct attribute *stats_attrs[] = {
> > +     &alloc_success_attr.attr,
> > +     &alloc_fail_attr.attr,
> > +     NULL,
> > +};
> > +
> > +static struct attribute_group stats_attr_group = {
> > +     .name = "stats",
> > +     .attrs = stats_attrs,
> > +};
> > +
> >  static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >  {
> >       unsigned long size = (PAGE_SIZE << order) / SZ_1K;
> > @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, struct kobject *parent)
> >               return ERR_PTR(ret);
> >       }
> >
> > +     ret = sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> > +     if (ret) {
> > +             kobject_put(&thpsize->kobj);
> > +             return ERR_PTR(ret);
> > +     }
> > +
> >       thpsize->order = order;
> >       return thpsize;
> >  }
> > @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >       folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
> >       if (unlikely(!folio)) {
> >               count_vm_event(THP_FAULT_FALLBACK);
> > +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
> >               return VM_FAULT_FALLBACK;
> >       }
> > +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
> >       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
> >  }
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 984b138f85b4..c9c1031c2ecb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >                       }
> >                       folio_throttle_swaprate(folio, gfp);
> >                       clear_huge_page(&folio->page, vmf->address, 1 << order);
> > +                     count_thp_event(order, THP_ALLOC_SUCCESS);
> >                       return folio;
> > +             } else {
>
> Do we need this else, given the last statement in the "if" clause is return?

I included this "else" statement because we encounter a scenario where
we "goto next" even if allocation succeeds.

                 folio = vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }

After revisiting the code for a second time, I noticed that the SUCCESS
counter is incremented after the "goto" statement, so I realize that I won't
need this "else" clause.

               folio = vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }
                        folio_throttle_swaprate(folio, gfp);
                        clear_huge_page(&folio->page, vmf->address, 1 << order);
                        count_thp_event(order, THP_ALLOC_SUCCESS);
                        return folio;


So, I have two options:

1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.

Since mem_cgroup_charge() rarely fails, option 2 should suffice.

By the way, why do we proceed to the next order after
mem_cgroup_charge() fails?
I assume mem_cgroup_charge() is still likely to fail even after we move to
the next order?


>
> > +                     count_thp_event(order, THP_ALLOC_FAIL);
> >               }
> >  next:
> >               order = next_order(&orders, order);
>
> Thanks,
> Ryan
>

Thanks
Barry





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux