On Wed, Jan 10, 2024 at 7:21 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 10/01/2024 11:00, David Hildenbrand wrote: > > On 10.01.24 11:55, Ryan Roberts wrote: > >> On 10/01/2024 10:42, David Hildenbrand wrote: > >>> On 10.01.24 11:38, Ryan Roberts wrote: > >>>> On 10/01/2024 10:30, Barry Song wrote: > >>>>> On Wed, Jan 10, 2024 at 6:23 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>>> > >>>>>> On 10/01/2024 09:09, Barry Song wrote: > >>>>>>> On Wed, Jan 10, 2024 at 4:58 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On 10/01/2024 08:02, Barry Song wrote: > >>>>>>>>> On Wed, Jan 10, 2024 at 12:16 PM John Hubbard <jhubbard@xxxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> On 1/9/24 19:51, Barry Song wrote: > >>>>>>>>>>> On Wed, Jan 10, 2024 at 11:35 AM John Hubbard <jhubbard@xxxxxxxxxx> > >>>>>>>>>>> wrote: > >>>>>>>>>> ... > >>>>>>>>>>>> Hi Ryan, > >>>>>>>>>>>> > >>>>>>>>>>>> One thing that immediately came up during some recent testing of mTHP > >>>>>>>>>>>> on arm64: the pid requirement is sometimes a little awkward. I'm > >>>>>>>>>>>> running > >>>>>>>>>>>> tests on a machine at a time for now, inside various containers and > >>>>>>>>>>>> such, and it would be nice if there were an easy way to get some > >>>>>>>>>>>> numbers > >>>>>>>>>>>> for the mTHPs across the whole machine. > >>>>>>>> > >>>>>>>> Just to confirm, you're expecting these "global" stats be truely global > >>>>>>>> and not > >>>>>>>> per-container? (asking because you exploicitly mentioned being in a > >>>>>>>> container). > >>>>>>>> If you want per-container, then you can probably just create the container > >>>>>>>> in a > >>>>>>>> cgroup? > >>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not sure if that changes anything about thpmaps here. Probably > >>>>>>>>>>>> this is fine as-is. But I wanted to give some initial reactions from > >>>>>>>>>>>> just some quick runs: the global state would be convenient. > >>>>>>>> > >>>>>>>> Thanks for taking this for a spin! Appreciate the feedback. > >>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> +1. but this seems to be impossible by scanning pagemap? > >>>>>>>>>>> so may we add this statistics information in kernel just like > >>>>>>>>>>> /proc/meminfo or a separate /proc/mthp_info? > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes. From my perspective, it looks like the global stats are more useful > >>>>>>>>>> initially, and the more detailed per-pid or per-cgroup stats are the > >>>>>>>>>> next level of investigation. So feels odd to start with the more > >>>>>>>>>> detailed stats. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> probably because this can be done without the modification of the kernel. > >>>>>>>> > >>>>>>>> Yes indeed, as John said in an earlier thread, my previous attempts to add > >>>>>>>> stats > >>>>>>>> directly in the kernel got pushback; DavidH was concerned that we don't > >>>>>>>> really > >>>>>>>> know exectly how to account mTHPs yet > >>>>>>>> (whole/partial/aligned/unaligned/per-size/etc) so didn't want to end up > >>>>>>>> adding > >>>>>>>> the wrong ABI and having to maintain it forever. There has also been some > >>>>>>>> pushback regarding adding more values to multi-value files in sysfs, so > >>>>>>>> David > >>>>>>>> was suggesting coming up with a whole new scheme at some point (I know > >>>>>>>> /proc/meminfo isn't sysfs, but the equivalent files for NUMA nodes and > >>>>>>>> cgroups > >>>>>>>> do live in sysfs). > >>>>>>>> > >>>>>>>> Anyway, this script was my attempt to 1) provide a short term solution > >>>>>>>> to the > >>>>>>>> "we need some stats" request and 2) provide a context in which to explore > >>>>>>>> what > >>>>>>>> the right stats are - this script can evolve without the ABI problem. > >>>>>>>> > >>>>>>>>> The detailed per-pid or per-cgroup is still quite useful to my case in > >>>>>>>>> which > >>>>>>>>> we set mTHP enabled/disabled and allowed sizes according to vma types, > >>>>>>>>> eg. libc_malloc, java heaps etc. > >>>>>>>>> > >>>>>>>>> Different vma types can have different anon_name. So I can use the > >>>>>>>>> detailed > >>>>>>>>> info to find out if specific VMAs have gotten mTHP properly and how many > >>>>>>>>> they have gotten. > >>>>>>>>> > >>>>>>>>>> However, Ryan did clearly say, above, "In future we may wish to > >>>>>>>>>> introduce stats directly into the kernel (e.g. smaps or similar)". And > >>>>>>>>>> earlier he ran into some pushback on trying to set up /proc or /sys > >>>>>>>>>> values because this is still such an early feature. > >>>>>>>>>> > >>>>>>>>>> I wonder if we could put the global stats in debugfs for now? That's > >>>>>>>>>> specifically supposed to be a "we promise *not* to keep this ABI stable" > >>>>>>>>>> location. > >>>>>>>> > >>>>>>>> Now that I think about it, I wonder if we can add a --global mode to the > >>>>>>>> script > >>>>>>>> (or just infer global when neither --pid nor --cgroup are provided). I > >>>>>>>> think I > >>>>>>>> should be able to determine all the physical memory ranges from > >>>>>>>> /proc/iomem, > >>>>>>>> then grab all the info we need from /proc/kpageflags. We should then be > >>>>>>>> able to > >>>>>>>> process it all in much the same way as for --pid/--cgroup and provide the > >>>>>>>> same > >>>>>>>> stats, but it will apply globally. What do you think? > >>>>>> > >>>>>> Having now thought about this for a few mins (in the shower, if anyone wants > >>>>>> the > >>>>>> complete picture :) ), this won't quite work. This approach doesn't have the > >>>>>> virtual mapping information so the best it can do is tell us "how many of > >>>>>> each > >>>>>> size of THP are allocated?" - it doesn't tell us anything about whether they > >>>>>> are > >>>>>> fully or partially mapped or what their alignment is (all necessary if we > >>>>>> want > >>>>>> to know if they are contpte-mapped). So I don't think this approach is > >>>>>> going to > >>>>>> be particularly useful. > >>>>>> > >>>>>> And this is also the big problem if we want to gather stats inside the > >>>>>> kernel; > >>>>>> if we want something equivalant to /proc/meminfo's > >>>>>> AnonHugePages/ShmemPmdMapped/FilePmdMapped, we need to consider not just the > >>>>>> allocation of the THP but also whether it is mapped. That's easy for > >>>>>> PMD-mappings, because there is only one entry to consider - when you set it, > >>>>>> you > >>>>>> increment the number of PMD-mapped THPs, when you clear it, you decrement. > >>>>>> But > >>>>>> for PTE-mappings it's harder; you know the size when you are mapping so its > >>>>>> easy > >>>>>> to increment, but you can do a partial unmap, so you would need to scan the > >>>>>> PTEs > >>>>>> to figure out if we are unmapping the first page of a previously > >>>>>> fully-PTE-mapped THP, which is expensive. We would need a cheap mechanism to > >>>>>> determine "is this folio fully and contiguously mapped in at least one > >>>>>> process?". > >>>>> > >>>>> as OPPO's approach I shared to you before is maintaining two mapcount > >>>>> 1. entire map > >>>>> 2. subpage's map > >>>>> 3. if 1 and 2 both exist, it is DoubleMapped. > >>>>> > >>>>> This isn't a problem for us. and everytime if we do a partial unmap, > >>>>> we have an explicit > >>>>> cont_pte split which will decrease the entire map and increase the > >>>>> subpage's mapcount. > >>>>> > >>>>> but its downside is that we expose this info to mm-core. > >>>> > >>>> OK, but I think we have a slightly more generic situation going on with the > >>>> upstream; If I've understood correctly, you are using the PTE_CONT bit in the > >>>> PTE to determne if its fully mapped? That works for your case where you only > >>>> have 1 size of THP that you care about (contpte-size). But for the upstream, we > >>>> have multi-size THP so we can't use the PTE_CONT bit to determine if its fully > >>>> mapped because we can only use that bit if the THP is at least 64K and aligned, > >>>> and only on arm64. We would need a SW bit for this purpose, and the mm would > >>>> need to update that SW bit for every PTE one the full -> partial map > >>>> transition. > >>> > >>> Oh no. Let's not make everything more complicated for the purpose of some stats. > >>> > >> > >> Indeed, I was intending to argue *against* doing it this way. Fundamentally, if > >> we want to know what's fully mapped and what's not, then I don't see any way > >> other than by scanning the page tables and we might as well do that in user > >> space with this script. > >> > >> Although, I expect you will shortly make a proposal that is simple to implement > >> and prove me wrong ;-) > > > > Unlikely :) As you said, once you have multiple folio sizes, it stops really > > making sense. > > > > Assume you have a 128 kiB pageache folio, and half of that is mapped. You can > > set cont-pte bits on that half and all is fine. Or AMD can benefit from it's > > optimizations without the cont-pte bit and everything is fine. > > Yes, but for debug and optimization, its useful to know when THPs are > fully/partially mapped, when they are unaligned etc. Anyway, the script does > that for us, and I think we are tending towards agreement that there are > unlikely to be any cost benefits by moving it into the kernel. > > > > > We want simple stats that tell us which folio sizes are actually allocated. For > > everything else, just scan the process to figure out what exactly is going on. > > > > Certainly that's much easier to do. But is it valuable? It might be if we also > keep stats for the number of failures to allocate the various sizes - then we > can see what percentage of high order allocation attempts are successful, which > is probably useful. +1 this is perfectly useful especially after memory is fragmented. In an embedded device, this can be absolutely true after the system runs for a while. That's why we have to set a large folios pool in products. otherwise, large folios only have a positive impact in the first hour. for your reference, i am posting some stats i have on my phone using OPPO's large folios approach, :/ # cat /proc/<oppo's large folios>/stat pool_size eg. <4GB> ---- we have a pool to help the success of large folios allocation pool_low ----- watermarks for the pool, we may begin to reclaim large folios when the pool has limited free memory pool_high thp_cow 1011488 ----- we are doing CoW for large folios thp_cow_fallback 584494 ------ we fail to allocate large folios for CoW, then fallback to normal page ... madv_free_unaligned 9159 ------ userspace unaligned madv_free madv_dont_need_unaligned 11358 ------ userspace unaligned madv_dontneed .... thp_do_anon_pages 131289845 thp_do_anon_pages_fallback 88911215 ----- fallback to normal pages in do_anon_pages thp_swpin_no_swapcache_entry ----- swapin large folios thp_swpin_no_swapcache_fallback_entry ----- swapin large folios fallback... thp_swpin_swapcache_entry ----- swapin large folios/swapcache case thp_swpin_swapcache_fallback_entry ----- swapin large folios/swapcache fallback to normal pages thp_file_entry 28998 thp_file_alloc_success 27334 thp_file_alloc_fail 1664 .... PartialMappedTHP: 29312 kB ---- these are folios which have ever been not entirely mapped. ----- this is also what i am suggesting to have in recent several replies :-) .... Thanks Barry