On 15/01/2024 09:48, Ryan Roberts wrote: > On 12/01/2024 19:14, John Hubbard wrote: >> On 1/12/24 02:00, Ryan Roberts wrote: >>>> ... >>>> After spending a day or two exploring running systems with this, I'd >>>> like to suggest: >>>> >>>> 1) measure "native PMD THPs" vs. pte-mapped mTHPs. This provides a lot >>>> of information: mTHP is configured as expected, and is helping or not, >>>> etc. >>> >>> There is a difference between how a THP is mapped (PTE vs PMD) and its size. A >>> PMD-sized THP can still be mapped with PTEs. So I'd rather not completely filter >>> out PMD-sized THPs, if that's your suggestion. But we could make a distinction >> >> It's not... >> >>> between THPs mapped by PTE and those mapped by PMD; the kernel interface doesn't >>> directly give us this, but we can infer it from the AnonHugePages and *PmdMapped >>> stats in smaps. >> >> Yes, that would be excellent! >> >>> >>>> >>>> 2) Not having to list out all the mTHP sizes would be nice. Instead, >>>> just use the possible sizes from /sys/kernel/mm/transparent_hugepage/* , >>>> unless the user specifies sizes. >>> >>> This is exactly what the tool already does. Perhaps you haven't fully understood >>> the counters that it outputs? >> >> Oh yes, we are in perfect agreement about my not understanding these >> counters. :) I'd even expound upon that a bit: despite having a fairly >> good working understanding the mTHP implementation in the kernel; >> despite reading and re-reading the thpmaps documentation and peeking a >> number of times at the thpmaps script; and despite poring over the >> thpmaps output, I am still having a rough time with these counters. >> Mainly because there is a set of hidden assumptions, many of which are >> revealed below. > > Oh dear sorry about that. Thanks for sticking with it and helping me get it right... > >> >> But it's actually just a few key points that were missing from the >> documentation, plus the ability to clearly see the pte-mapped parts. And >> your proposed changes below look great; I've got a few more to add and >> that should finish the job. > > OK good! > >> >>> >>> You *always* get the following counters (although note the tool *hides* all >> >> Good. It was not clear that these counters were always active. The --cont >> documentation misleads the reader a bit on that matter. >> >>> counters whose value is 0 by default - show them with --inc-empty). This example >>> is for a system with 4K base pages: >>> >>> # thpmaps --pid 1 --summary --inc-empty >>> >>> anon-thp-aligned-16kB: >>> anon-thp-aligned-32kB: >>> anon-thp-aligned-64kB: >>> anon-thp-aligned-128kB: >>> anon-thp-aligned-256kB: >>> anon-thp-aligned-512kB: >>> anon-thp-aligned-1024kB: >>> anon-thp-aligned-2048kB: >>> anon-thp-unaligned-16kB: >>> anon-thp-unaligned-32kB: >>> anon-thp-unaligned-64kB: >>> anon-thp-unaligned-128kB: >>> anon-thp-unaligned-256kB: >>> anon-thp-unaligned-512kB: >>> anon-thp-unaligned-1024kB: >>> anon-thp-unaligned-2048kB: >>> anon-thp-partial: >>> file-thp-aligned-16kB: >>> file-thp-aligned-32kB: >>> file-thp-aligned-64kB: >>> file-thp-aligned-128kB: >>> file-thp-aligned-256kB: >>> file-thp-aligned-512kB: >>> file-thp-aligned-1024kB: >>> file-thp-aligned-2048kB: >>> file-thp-unaligned-16kB: >>> file-thp-unaligned-32kB: >>> file-thp-unaligned-64kB: >>> file-thp-unaligned-128kB: >>> file-thp-unaligned-256kB: >>> file-thp-unaligned-512kB: >>> file-thp-unaligned-1024kB: >>> file-thp-unaligned-2048kB: >>> file-thp-partial: >>> >>> So you have counters for every supported THP size in the system - they will be >>> different for a 64K base page system. >>> >>> anon vs file: hopefully obvious >>> >>> aligned vs unaligned: In both cases the THP is mapped fully and contiguously. In >>> the aligned cases it is mapped so that it is naturally aligned. So a 16K THP is >> >> I think we should use "aligned" or "aligned to <size>", and stop saying >> "naturally aligned", throughout. "Natural" adds no additional >> information, and it makes the reader wonder if there is some other >> aspect to the alignment (does natural imply PMD-mapped? etc) that they >> are unaware of. > > OK. I thought "naturally aligned" was a fairly standard and well-understood > term. Google says "We call a datum naturally aligned if its address is aligned > to its size". But I'm happy to use the phrase "aligned to <size>" if that's clearer. > >> >> >>> mapped into VA space on a 16K boundary, a 32K THP on a 32K boundary, etc. >>> >>> partial: Parts of THPs that are partially mapped into VA space. >>> >>> Note this does not draw a distinction between PMD-mapped and PTE-mapped THPs. >>> But a THP can only be PMD-mapped if it is both PMD-aligned and PMD-sized. So >>> only 2 counters can include PMD-mappings; anon-thp-aligned-2048kB and >>> file-thp-aligned-2048kB. We can filter that out by subtracting the relevant >>> smaps counters from them. I could add a --ignore-pmd-mapped flag to do that? Or >> >> That would work but is relatively awkward, but...1 >> >>> I could rename all the existing counters to include "pte" and introduce 2 new >>> counters: anon-thp-aligned-pmd-2048kB and file-thp-aligned-pmd-2048kB? >> >> ...this would be perfect, I think. The "pte" would help self-document, and >> separately things out allows for a clearer view into the behavior. >> >>> >>> The --cont option will add *additional* special counters, if specified. The idea >>> here is to provide a view on what percentage of memory is getting >>> contpte-mapped. So if you provide "--cont 64K" it will give you a counter >>> showing how much memory is in 64K, naturally aligned blocks (actually 2 >>> counters; file and anon). Those blocks can come from fully mapped and aligned >>> 64K THPs. But they can also come from bigger THPs - for example, if a 128K THP >>> is aligned on a 64K boundary (but not a 128K boundary), then it will provide 2 >>> 64K cont blocks, but it will be counted as unaligned in >>> anon-thp-unaligned-128kB. Or if a 2M THP is partially mapped so that only it's >>> first 1M is mapped and aligned on a 64K boundary, then it will be counted in the >>> *-thp-partial counter and would add 1M to the *-cont-aligned-64kB counter. >>> >> >> Interesting, and completely undocumented until now. Let's add this to the >> tool's help output! In fact, all of the above. > > Well it already has this, which I intended to convey the same info: > > --cont size[KMG] Adds anon and file stats for naturally aligned, > contiguously mapped blocks of the specified size. May be > issued multiple times to track multiple sized blocks. > Useful to infer e.g. arm64 contpte and hpa mappings. Size > must be a power-of-2 number of pages. > > But yes, let me work up some improved documentation and send it out for your > review. The reason its a bit terse at the moment, is that I'm using Python's > ArgumentParser for the documentation, and it removes all line breaks from the > description which makes it hard to format longer form docs. Anyway, that's a bad > excuse for bad docs so I'll figure out a solution. Here is my proposed documentation. If you could take a look and let me know if it makes sense, then I'll modify the tool to conform: --8<-- $ ./thpmaps --help usage: thpmaps [-h] [--pid pid | --cgroup path] [--rollup] [--cont size[KMG]] [--inc-smaps] [--inc-empty] [--periodic sleep_ms] Prints information about how transparent huge pages are mapped, either system- wide, or for a specified process or cgroup. A default set of statistics is always generated for THP mappings. However, it is also possible to generate additional statistics for "contiguous block mappings" where the block size is user-defined. Statistics are maintained independently for anonymous and file-backed (pagecache) memory and are shown both in kB and as a percentage of either total anonymous or total file-backed memory as appropriate. THP Statistics -------------- Statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is aligned to their size, for each <size> supported by the system. Separate counters describe THPs mapped by PTE vs those mapped by PMD. (Although note a THP can only be mapped by PMD if it is PMD-sized): - anon-thp-pte-aligned-<size>kB - file-thp-pte-aligned-<size>kB - anon-thp-pmd-aligned-<size>kB - file-thp-pmd-aligned-<size>kB Similarly, statistics are always generated for fully- and contiguously-mapped THPs whose mapping address is *not* aligned to their size, for each <size> supported by the system. Due to the unaligned mapping, it is impossible to map by PMD, so there are only PTE counters for this case: - anon-thp-pte-unaligned-<size>kB - file-thp-pte-unaligned-<size>kB Statistics are also always generated for mapped pages that belong to a THP but where the is THP is *not* fully- and contiguously- mapped. These "partial" mappings are all counted in the same counter regardless of the size of the THP that is partially mapped: - anon-thp-pte-partial - file-thp-pte-partial Contiguous Block Statistics --------------------------- An optional, additional set of statistics is generated for every contiguous block size specified with `--cont <size>`. These statistics show how much memory is mapped in contiguous blocks of <size> and also aligned to <size>. A given contiguous block must all belong to the same THP, but there is no requirement for it to be the *whole* THP. Separate counters describe contiguous blocks mapped by PTE vs those mapped by PMD: - anon-cont-pte-aligned-<size>kB - file-cont-pte-aligned-<size>kB - anon-cont-pmd-aligned-<size>kB - file-cont-pmd-aligned-<size>kB As an example, if montiroing 64K contiguous blocks (--cont 64K), there are a number of sources that could provide such blocks: a fully- and contiguously- mapped 64K THP that is aligned to a 64K boundary would provide 1 block. A fully- and contiguously-mapped 128K THP that is aligned to at least a 64K boundary would provide 2 blocks. Or a 128K THP that maps its first 100K, but contiguously and starting at a 64K boundary would provide 1 block. A fully- and contiguously- mapped 2M THP would provide 32 blocks. There are many other possible permutations. optional arguments: -h, --help show this help message and exit --pid pid Process id of the target process. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --cgroup path Path to the target cgroup in sysfs. Iterates over every pid in the cgroup and its children. --pid and --cgroup are mutually exclusive. If neither are provided, all processes are scanned to provide system-wide information. --rollup Sum the per-vma statistics to provide a summary over the whole system, process or cgroup. --cont size[KMG] Adds stats for memory that is mapped in contiguous blocks of <size> and also aligned to <size>. May be issued multiple times to track multiple sized blocks. Useful to infer e.g. arm64 contpte and hpa mappings. Size must be a power-of-2 number of pages. --inc-smaps Include all numerical, additive /proc/<pid>/smaps stats in the output. --inc-empty Show all statistics including those whose value is 0. --periodic sleep_ms Run in a loop, polling every sleep_ms milliseconds. Requires root privilege to access pagemap and kpageflags. --8<-- Thanks, Ryan > > >> >>> >>> Sorry if I've labored the point here. But I think the only thing the tool >>> doesn't already do that you are asking for is to differentiate PTE- vs PMD- >>> mappings? >> >> That, plus explain itself, yes. :) > > Excellent! I'll post a follow up shortly. > >> >>> >>>> >>>> ... >>>> (e.g. /sys/fs/cgroup for cgroup-v2 or >>>>>>> /sys/fs/cgroup/pids for cgroup-v1). Exactly one >>>>>>> of --pid and --cgroup must be provided. >>>>>> >>>>>> Maybe we could add "--global" to that list. That would look, in order, >>>>>> inside cgroups2 and cgroups, for a list of pids, and then run as if >>>>>> --cgroup /sys/fs/cgroup or --cgroup /sys/fs/cgroup/pids were specified. >>>>> >>>>> I think actually it might be better just to make global the default when >>>>> neither >>>>> --pid nor --cgroup are provided? And in this case, I'll just grab all the pids >>>>> from /proc rather than traverse the cgroup hierachy, that way it will work on >>>>> systems without cgroups. Does that work for you? >>>> >>>> Yes! That was my initial idea, in fact, and after over-thinking it for >>>> a while, it turned into the above. haha :) >>> >>> OK great - implemented for v3. >>> >> >> Sweet! >> >> >> thanks, >