On Fri, Apr 5, 2024 at 8:18 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 05/04/2024 05:01, Barry Song wrote: > > On Fri, Apr 5, 2024 at 3:57 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > >> > >> On Fri, Apr 5, 2024 at 4:31 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>> > >>> On 04.04.24 09:21, Ryan Roberts wrote: > >>>> On 03/04/2024 22:00, Barry Song wrote: > >>>>> On Thu, Apr 4, 2024 at 12:48 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>>> > >>>>>> On 03/04/2024 09:22, David Hildenbrand wrote: > >>>>>>> On 03.04.24 05:55, Barry Song wrote: > >>>>>>>> From: Barry Song <v-songbaohua@xxxxxxxx> > >>>>>>>> > >>>>>>>> Profiling a system blindly with mTHP has become challenging due > >>>>>>>> to the lack of visibility into its operations. Presenting the > >>>>>>>> success rate of mTHP allocations appears to be pressing need. > >>>>>>>> > >>>>>>>> Recently, I've been experiencing significant difficulty debugging > >>>>>>>> performance improvements and regressions without these figures. > >>>>>>>> It's crucial for us to understand the true effectiveness of > >>>>>>>> mTHP in real-world scenarios, especially in systems with > >>>>>>>> fragmented memory. > >>>>>>>> > >>>>>>>> This patch sets up the framework for per-order mTHP counters, > >>>>>>>> starting with the introduction of anon_alloc_success and > >>>>>>>> anon_alloc_fail counters. Incorporating additional counters > >>>>>>>> should now be straightforward as well. > >>>>>>>> > >>>>>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>>>>> --- > >>>>>>>> -v3: > >>>>>>>> * save some memory as order-0 and order-1 can't be THP, Ryan; > >>>>>>>> * rename to anon_alloc as right now we only support anon to address > >>>>>>>> David's comment; > >>>>>>>> * drop a redundant "else", Ryan > >>>>>>>> > >>>>>>>> include/linux/huge_mm.h | 18 ++++++++++++++ > >>>>>>>> mm/huge_memory.c | 54 +++++++++++++++++++++++++++++++++++++++++ > >>>>>>>> mm/memory.c | 2 ++ > >>>>>>>> 3 files changed, 74 insertions(+) > >>>>>>>> > >>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > >>>>>>>> index e896ca4760f6..5e9af6be9537 100644 > >>>>>>>> --- a/include/linux/huge_mm.h > >>>>>>>> +++ b/include/linux/huge_mm.h > >>>>>>>> @@ -70,6 +70,7 @@ extern struct kobj_attribute shmem_enabled_attr; > >>>>>>>> * (which is a limitation of the THP implementation). > >>>>>>>> */ > >>>>>>>> #define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1))) > >>>>>>>> +#define THP_MIN_ORDER 2 > >>>>>>>> /* > >>>>>>>> * Mask of all large folio orders supported for file THP. > >>>>>>>> @@ -264,6 +265,23 @@ unsigned long thp_vma_allowable_orders(struct > >>>>>>>> vm_area_struct *vma, > >>>>>>>> enforce_sysfs, orders); > >>>>>>>> } > >>>>>>>> +enum thp_event_item { > >>>>>>>> + THP_ANON_ALLOC_SUCCESS, > >>>>>>>> + THP_ANON_ALLOC_FAIL, > >>>>>>>> + NR_THP_EVENT_ITEMS > >>>>>>>> +}; > >>>>>>> > >>>>>>> Maybe use a prefix that resembles matches the enum name and is "obviously" > >>>>>>> different to the ones in vm_event_item.h, like > >>>>>>> > >>>>>>> enum thp_event { > >>>>>>> THP_EVENT_ANON_ALLOC_SUCCESS, > >>>>>>> THP_EVENT_ANON_ALLOC_FAIL, > >>>>>>> __THP_EVENT_COUNT, > >>>>>>> }; > >>>>>> > >>>>>> FWIW, I'd personally replace "event" with "stat". For me "event" only ever > >>>>>> increments, but "stat" can increment and decrement. An event is a type of stat. > >>>>>> > >>>>>> You are only adding events for now, but we have identified a need for inc/dec > >>>>>> stats that will be added in future. > >>>>> > >>>>> What about the below? > >>>>> > >>>>> enum thp_stat { > >> > >> It seems we still need to use enum thp_stat_item rather than thp_stat. > >> This follows > >> enum zone_stat_item > >> enum numa_stat_item > >> enum node_stat_item > >> > >> And most importantly, the below looks much better > >> > >> enum thp_stat_item { > >> THP_STAT_ANON_ALLOC, > >> THP_STAT_ANON_ALLOC_FALLBACK, > >> __THP_STAT_COUNT > >> }; > >> > >> struct thp_state { > >> unsigned long state[PMD_ORDER + 1][__THP_STAT_COUNT]; > >> }; > >> > >> DECLARE_PER_CPU(struct thp_state, thp_states); > >> > >> than > >> > >> enum thp_stat { > >> THP_STAT_ANON_ALLOC, > >> THP_STAT_ANON_ALLOC_FALLBACK, > >> __THP_STAT_COUNT > >> }; > >> > >> struct thp_state { > >> unsigned long state[PMD_ORDER + 1][__THP_STAT_COUNT]; > >> }; > >> > >>>>> THP_EVENT_ANON_ALLOC, > >>>>> THP_EVENT_ANON_ALLOC_FALLBACK, > >>>>> THP_EVENT_SWPOUT, > >>>>> THP_EVENT_SWPOUT_FALLBACK, > >>>>> ... > >>>>> THP_NR_ANON_PAGES, > >>>>> THP_NR_FILE_PAGES, > >>>> > >>>> I find this ambiguous; Is it the number of THPs or the number of base pages? > >>>> > >>>> I think David made the point about incorporating the enum name into the labels > >>>> too, so that there can be no namespace confusion. How about: > >>>> > >>>> <enum>_<type>_<name> > >>>> > >>>> So: > >>>> > >>>> THP_STAT_EV_ANON_ALLOC > >>>> THP_STAT_EV_ANON_ALLOC_FALLBACK > >>>> THP_STAT_EV_ANON_PARTIAL > >>>> THP_STAT_EV_SWPOUT > >>>> THP_STAT_EV_SWPOUT_FALLBACK > >>>> ... > >>>> THP_STAT_NR_ANON > >>>> THP_STAT_NR_FILE > >>>> ... > >>>> __THP_STAT_COUNT > >>> > >>> I'd even drop the "EV". "NR_ANON" vs "ANON_ALLOC" etc. is expressive enough. > >> > >> ok. > > > > Hi David, Ryan, > > > > I've named everything as follows. Please let me know if you have any further > > suggestions before I send the updated version :-) > > Please treat all my comments below as optional - I'm just stating my preference! > > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index e896ca4760f6..cc13fa14aa32 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct > > vm_area_struct *vma, > > enforce_sysfs, orders); > > } > > > > +enum thp_stat_item { > > + THP_STAT_ANON_ALLOC, > > + THP_STAT_ANON_ALLOC_FALLBACK, > > + __THP_STAT_COUNT > > +}; > > + > > +struct thp_state { > > + unsigned long state[PMD_ORDER + 1][__THP_STAT_COUNT]; > > Why using "state" here? I think "stats" is more appropriate? (as in short for > "statistics") i.e. `struct thp_stats` and `unsigned long stats[][]`. Sounds good. > > > +}; > > + > > +DECLARE_PER_CPU(struct thp_state, thp_states); > > + > > +static inline void count_thp_state(int order, enum thp_stat_item item) > > +{ > > + this_cpu_inc(thp_states.state[order][item]); > > +} > > + > > #define transparent_hugepage_use_zero_page() \ > > (transparent_hugepage_flags & \ > > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 9d4b2fbf6872..e704b4408181 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -526,6 +526,46 @@ static const struct kobj_type thpsize_ktype = { > > .sysfs_ops = &kobj_sysfs_ops, > > }; > > > > +DEFINE_PER_CPU(struct thp_state, thp_states) = {{{0}}}; > > Does this need to be explicitly zeroed? Won't that be the default initial state, > just like for regular globals? Perhaps PER_CPU is special? I really don't know. uninitialized data is usually put in bss section so kernel will memset them to 0. but uninitialized per-cpu data is still in .data..percpu section and I can't find where memset is done in this area as this section can also contain initialized per-cpu data. Will compilers fill zero for uninitialized per-cpu data automatically so kernel just thinks they are initialized 0 data? anyway, I copied the code from vmstat DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}}; > > > + > > +static unsigned long sum_thp_states(int order, enum thp_stat_item item) > > Again, I'd call it sum_thp_stats(). sounds good. > > > +{ > > + unsigned long sum = 0; > > + int cpu; > > + > > + for_each_online_cpu(cpu) { > > + struct thp_state *this = &per_cpu(thp_states, cpu); > > + > > + sum += this->state[order][item]; > > + } > > + > > + return sum; > > +} > > + > > +#define THP_STATE_ATTR(_name, _index) \ > > And THP_STATS_ATTR(); they are going to live in the "stats" directory after all. sounds good. > > > +static ssize_t _name##_show(struct kobject *kobj, \ > > + struct kobj_attribute *attr, char *buf) \ > > +{ \ > > + int order = to_thpsize(kobj)->order; \ > > + \ > > + return sysfs_emit(buf, "%lu\n", sum_thp_states(order, _index)); \ > > +} \ > > +static struct kobj_attribute _name##_attr = __ATTR_RO(_name) > > + > > +THP_STATE_ATTR(anon_alloc, THP_STAT_ANON_ALLOC); > > +THP_STATE_ATTR(anon_alloc_fallback, THP_STAT_ANON_ALLOC_FALLBACK); > > > >> > >>> > >>> -- > >>> Cheers, > >>> > >>> David / dhildenb > > > > Thanks > > Barry >