On Thu, Mar 28, 2024 at 5:51 AM Barry Song <21cnbao@xxxxxxxxx> wrote: > > From: Barry Song <v-songbaohua@xxxxxxxx> > > Profiling a system blindly with mTHP has become challenging due > to the lack of visibility into its operations. Presenting the > success rate of mTHP allocations appears to be pressing need. > > Recently, I've been experiencing significant difficulty debugging > performance improvements and regressions without these figures. > It's crucial for us to understand the true effectiveness of > mTHP in real-world scenarios, especially in systems with > fragmented memory. > > This patch sets up the framework for per-order mTHP counters, > starting with the introduction of alloc_success and alloc_fail > counters. Incorporating additional counters should now be > straightforward as well. > > The initial two unsigned longs for each event are unused, given > that order-0 and order-1 are not mTHP. Nonetheless, this refinement > improves code clarity. > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > --- > -v2: > * move to sysfs and provide per-order counters; David, Ryan, Willy > -v1: > https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@xxxxxxxxx/ > > include/linux/huge_mm.h | 17 +++++++++++++ > mm/huge_memory.c | 54 +++++++++++++++++++++++++++++++++++++++++ > mm/memory.c | 3 +++ > 3 files changed, 74 insertions(+) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index e896ca4760f6..27fa26a22a8f 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma, > enforce_sysfs, orders); > } > > +enum thp_event_item { > + THP_ALLOC_SUCCESS, > + THP_ALLOC_FAIL, > + NR_THP_EVENT_ITEMS > +}; > + > +struct thp_event_state { > + unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS]; > +}; > + > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states); Do we have existing per-CPU counters that cover all possible THP orders? I.e., foo_counter[PMD_ORDER + 1][BAR_ITEMS]. I don't think we do but I want to double check. This might be fine if BAR_ITEMS is global, not per memcg. Otherwise on larger systems, e.g., 512 CPUs which is not uncommon, we'd have high per-CPU memory overhead. For Google's datacenters, per-CPU memory overhead has been a problem. I'm not against this patch since NR_THP_EVENT_ITEMS is not per memcg. Alternatively, we could make the per-CPU counters to track only one order and flush the local counter to a global atomic counter if the new order doesn't match the existing order stored in the local counter. WDYT?