Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") introduced the option for clearing/initializing kernel pages on allocation time - this can be achieved either using a parameter or a Kconfig setting. The goal for the change was a kernel hardening measure. Despite the performance degradation with "init_on_alloc" is considered low, there is one case in which it can be noticed and it may impact latency of the system - this is when hugetlb pages are allocated. Those pages are meant to be used by userspace *only* (differently of THP, for example). In allocation time for hugetlb, their component pages go through the initialization/clearing process in prep_new_page() kernel_init_free_pages() and, when used in userspace mappings, the hugetlb are _again_ cleared; I've checked that in practice by running the kernel selftest[0] for hugetlb mapping - the pages go through clear_huge_pages() on page fault [ see hugetlb_no_page() ]. This patch proposes a way to prevent this resource waste by skipping the page initialization/clearing if the page is a component of hugetlb page (even if "init_on_alloc" or the respective Kconfig are set). The performance improvement measured in [1] demonstrates that it is effective and bring the hugetlb allocation time to the same level as with "init_on_alloc" disabled. Despite we've used sysctl to allocate hugetlb pages in our tests, the same delay happens in early boot time when hugetlb parameters are set on kernel cmdline (and "init_on_alloc" is set). [0] tools/testing/selftests/vm/map_hugetlb.c [1] Test results - all tests executed in a pristine kernel 5.9+, from 2020-10-19, at commit 7cf726a594353010. A virtual machine with 96G of total memory was used, the test consists in allocating 64G of 2M hugetlb pages. Results below: * Without this patch, init_on_alloc=1 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892212 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m24.189s user 0m0.000s sys 0m24.184s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30784732 kB Hugetlb: 67108864 kB * Without this patch, init_on_alloc=0 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892752 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.316s user 0m0.000s sys 0m0.316s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30783628 kB Hugetlb: 67108864 kB * WITH this patch, init_on_alloc=1 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97891952 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.209s user 0m0.000s sys 0m0.209s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30782964 kB Hugetlb: 67108864 kB * WITH this patch, init_on_alloc=0 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892620 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.206s user 0m0.000s sys 0m0.206s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30783804 kB Hugetlb: 67108864 kB Signed-off-by: Guilherme G. Piccoli <gpiccoli@xxxxxxxxxxxxx> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx> Cc: Alexander Potapenko <glider@xxxxxxxxxx> Cc: James Morris <jamorris@xxxxxxxxxxxxxxxxxxx> Cc: Kees Cook <keescook@xxxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> --- Hi everybody, thanks in advance for the review/comments. I'd like to point 2 things related to the implementation: 1) I understand that adding GFP flags is not really welcome by the mm community; I've considered passing that as function parameter but that would be a hacky mess, so I decided to add the flag since it seems this is a fair use of the flag mechanism (to control actions on pages). If anybody has a better/simpler suggestion to implement this, I'm all ears - thanks! 2) The checkpatch script gave me the following error, but I decided to ignore it in order to maintain the same format present in the file: ERROR: space required after that close brace '}' #171: FILE: include/trace/events/mmflags.h:52: include/linux/gfp.h | 14 +++++++++----- include/linux/mm.h | 2 +- include/trace/events/mmflags.h | 3 ++- mm/hugetlb.c | 7 +++++++ tools/perf/builtin-kmem.c | 1 + 5 files changed, 20 insertions(+), 7 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c603237e006c..c03909f8e7b6 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,8 +39,9 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x100000u #define ___GFP_THISNODE 0x200000u #define ___GFP_ACCOUNT 0x400000u +#define ___GFP_NOINIT_ON_ALLOC 0x800000u #ifdef CONFIG_LOCKDEP -#define ___GFP_NOLOCKDEP 0x800000u +#define ___GFP_NOLOCKDEP 0x1000000u #else #define ___GFP_NOLOCKDEP 0 #endif @@ -215,16 +216,19 @@ struct vm_area_struct; * %__GFP_COMP address compound page metadata. * * %__GFP_ZERO returns a zeroed page on success. + * + * %__GFP_NOINIT_ON_ALLOC avoids uspace pages to be double-cleared (like HugeTLB) */ -#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) -#define __GFP_COMP ((__force gfp_t)___GFP_COMP) -#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) +#define __GFP_COMP ((__force gfp_t)___GFP_COMP) +#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NOINIT_ON_ALLOC ((__force gfp_t)___GFP_NOINIT_ON_ALLOC) /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360fe70aaf..7fa60d22a90a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2882,7 +2882,7 @@ DECLARE_STATIC_KEY_FALSE(init_on_alloc); static inline bool want_init_on_alloc(gfp_t flags) { if (static_branch_unlikely(&init_on_alloc) && - !page_poisoning_enabled()) + !page_poisoning_enabled() && !(flags & __GFP_NOINIT_ON_ALLOC)) return true; return flags & __GFP_ZERO; } diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 67018d367b9f..89b0c0ddcc52 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -48,7 +48,8 @@ {(unsigned long)__GFP_WRITE, "__GFP_WRITE"}, \ {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ - {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"}\ + {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ + {(unsigned long)__GFP_NOINIT_ON_ALLOC, "__GFP_NOINIT_ON_ALLOC"}\ #define show_gfp_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fe76f8fd5a73..c60a6726b0be 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1742,6 +1742,13 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, { struct page *page; + /* + * Signal the following page allocs to avoid them being cleared + * in allocation time - since HugeTLB pages are *only* used as + * userspace pages, they'll be cleared by default before usage. + */ + gfp_mask |= __GFP_NOINIT_ON_ALLOC; + if (hstate_is_gigantic(h)) page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index a50dae2c4ae9..bef90d8bb7f6 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -660,6 +660,7 @@ static const struct { { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, + { "__GFP_NOINIT_ON_ALLOC", "NIA" }, }; static size_t max_gfp_len; -- 2.28.0