> On Jan 28, 2021, at 8:33 AM, Zi Yan <ziy@xxxxxxxxxx> wrote: > > On 28 Jan 2021, at 5:49, Saravanan D wrote: > >> To help with debugging the sluggishness caused by TLB miss/reload, >> we introduce monotonic lifetime hugepage split event counts since >> system state: SYSTEM_RUNNING to be displayed as part of >> /proc/vmstat in x86 servers >> >> The lifetime split event information will be displayed at the bottom of >> /proc/vmstat >> .... >> swap_ra 0 >> swap_ra_hit 0 >> direct_map_level2_splits 94 >> direct_map_level3_splits 4 >> nr_unstable 0 >> .... >> >> One of the many lasting (as we don't coalesce back) sources for huge page >> splits is tracing as the granular page attribute/permission changes would >> force the kernel to split code segments mapped to huge pages to smaller >> ones thereby increasing the probability of TLB miss/reload even after >> tracing has been stopped. > > It is interesting to see this statement saying splitting kernel direct mappings > causes performance loss, when Zhengjun (cc’d) from Intel recently posted > a kernel direct mapping performance report[1] saying 1GB mappings are good > but not much better than 2MB and 4KB mappings. > > I would love to hear the stories from both sides. Or maybe I misunderstand > anything. We had an issue about 1.5 years ago, when ftrace splits 2MB kernel text page table entry into 512x 4kB ones. This split caused ~1% performance regression. That instance was fixed in [1]. Saravanan, could you please share more information about the split. Is it possible to avoid the split? If not, can we regroup after tracing is disabled? We have the split-and-regroup logic for application .text on THP. When uprobe is attached to the THP text, we have to split the 2MB page table entry. So we introduced mechanism to regroup the 2MB page table entry when all uprobes are removed from the THP [2]. Thanks, Song [1] commit 7af0145067bc ("x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text") [2] commit f385cb85a42f ("uprobe: collapse THP pmd after removing all uprobes") > > > [1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@xxxxxxxxxxxxxxx/ >> >> Documentation regarding linear mapping split events added to admin-guide >> as requested in V3 of the patch. >> >> Signed-off-by: Saravanan D <saravanand@xxxxxx> >> --- >> .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++ >> Documentation/admin-guide/mm/index.rst | 1 + >> arch/x86/mm/pat/set_memory.c | 8 +++ >> include/linux/vm_event_item.h | 4 ++ >> mm/vmstat.c | 4 ++ >> 5 files changed, 76 insertions(+) >> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst >> >> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> new file mode 100644 >> index 000000000000..298751391deb >> --- /dev/null >> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst >> @@ -0,0 +1,59 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +===================== >> +Direct Mapping Splits >> +===================== >> + >> +Kernel maps all of physical memory in linear/direct mapped pages with >> +translation of virtual kernel address to physical address is achieved >> +through a simple subtraction of offset. CPUs maintain a cache of these >> +translations on fast caches called TLBs. CPU architectures like x86 allow >> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in >> +various page table levels. >> + >> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure. >> +The splintering of huge direct pages into smaller ones does result in >> +a measurable performance hit caused by frequent TLB miss and reloads. >> + >> +One of the many lasting (as we don't coalesce back) sources for huge page >> +splits is tracing as the granular page attribute/permission changes would >> +force the kernel to split code segments mapped to hugepages to smaller >> +ones thus increasing the probability of TLB miss/reloads even after >> +tracing has been stopped. >> + >> +On x86 systems, we can track the splitting of huge direct mapped pages >> +through lifetime event counters in ``/proc/vmstat`` >> + >> + direct_map_level2_splits xxx >> + direct_map_level3_splits yyy >> + >> +where: >> + >> +direct_map_level2_splits >> + are 2M/4M hugepage split events >> +direct_map_level3_splits >> + are 1G hugepage split events >> + >> +The distribution of direct mapped system memory in various page sizes >> +post splits can be viewed through ``/proc/meminfo`` whose output >> +will include the following lines depending upon supporting CPU >> +architecture >> + >> + DirectMap4k: xxxxx kB >> + DirectMap2M: yyyyy kB >> + DirectMap1G: zzzzz kB >> + >> +where: >> + >> +DirectMap4k >> + is the total amount of direct mapped memory (in kB) >> + accessed through 4k pages >> +DirectMap2M >> + is the total amount of direct mapped memory (in kB) >> + accessed through 2M pages >> +DirectMap1G >> + is the total amount of direct mapped memory (in kB) >> + accessed through 1G pages >> + >> + >> +-- Saravanan D, Jan 27, 2021 >> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst >> index 4b14d8b50e9e..9439780f3f07 100644 >> --- a/Documentation/admin-guide/mm/index.rst >> +++ b/Documentation/admin-guide/mm/index.rst >> @@ -38,3 +38,4 @@ the Linux memory management. >> soft-dirty >> transhuge >> userfaultfd >> + direct_mapping_splits >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c >> index 16f878c26667..a7b3c5f1d316 100644 >> --- a/arch/x86/mm/pat/set_memory.c >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -16,6 +16,8 @@ >> #include <linux/pci.h> >> #include <linux/vmalloc.h> >> #include <linux/libnvdimm.h> >> +#include <linux/vmstat.h> >> +#include <linux/kernel.h> >> >> #include <asm/e820/api.h> >> #include <asm/processor.h> >> @@ -91,6 +93,12 @@ static void split_page_count(int level) >> return; >> >> direct_pages_count[level]--; >> + if (system_state == SYSTEM_RUNNING) { >> + if (level == PG_LEVEL_2M) >> + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT); >> + else if (level == PG_LEVEL_1G) >> + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT); >> + } >> direct_pages_count[level - 1] += PTRS_PER_PTE; >> } >> >> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h >> index 18e75974d4e3..7c06c2bdc33b 100644 >> --- a/include/linux/vm_event_item.h >> +++ b/include/linux/vm_event_item.h >> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >> #ifdef CONFIG_SWAP >> SWAP_RA, >> SWAP_RA_HIT, >> +#endif >> +#ifdef CONFIG_X86 >> + DIRECT_MAP_LEVEL2_SPLIT, >> + DIRECT_MAP_LEVEL3_SPLIT, >> #endif >> NR_VM_EVENT_ITEMS >> }; >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index f8942160fc95..a43ac4ac98a2 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = { >> "swap_ra", >> "swap_ra_hit", >> #endif >> +#ifdef CONFIG_X86 >> + "direct_map_level2_splits", >> + "direct_map_level3_splits", >> +#endif >> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ >> }; >> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ >> -- >> 2.24.1 > > > — > Best Regards, > Yan Zi