On Mon, 2022-10-10 at 19:08 +0000, Song Liu wrote: > > On Oct 10, 2022, at 11:32 AM, Edgecombe, Rick P < > > rick.p.edgecombe@xxxxxxxxx> wrote: > > > > On Fri, 2022-10-07 at 16:43 -0700, Song Liu wrote: > > > On x86 kernel, we allocate 2MB pages for kernel text up to > > > round_down(_etext, 2MB). Therefore, some of the kernel text is > > > still > > > on 4kB pages. With vmalloc_exec, we can allocate 2MB pages up to > > > round_up(_etext, 2MB), and use the rest of the page for modules > > > and > > > BPF programs. > > > > > > Here is an example: > > > > > > [root@eth50-1 ~]# grep _etext /proc/kallsyms > > > ffffffff82202a08 T _etext > > > > > > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms | tail -n 3 > > > ffffffff8220f920 t > > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup [bpf] > > > ffffffff8220fa28 t > > > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new [bpf] > > > ffffffff8220fad4 t > > > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch [bpf] > > > > > > [root@eth50-1 ~]# grep 0xffffffff82200000 > > > /sys/kernel/debug/page_tables/kernel > > > 0xffffffff82200000- > > > 0xffffffff82400000 2M ro PSE x pmd > > > > > > [root@eth50-1 ~]# grep xfs_flush_inodes /proc/kallsyms > > > ffffffff822ba910 t xfs_flush_inodes_worker [xfs] > > > ffffffff822bc580 t xfs_flush_inodes [xfs] > > > > > > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel > > > text, > > > xfs > > > module, and bpf programs. > > > > Can this memory range be freed as part of a vfree_exec() call then? > > Does vmalloc actually try to unmap it? If so, it could get > > complicated > > with PTI. > > > > It probably should be a special case that never gets fully freed. > > Right, this is never freed. Can we get a comment somewhere highlighting how this is avoided? Maybe this is just me missing some vmalloc understanding, but this pointer to an all zero vm_struct seems weird too. Are there other vmap allocations like this? Which vmap APIs work with this and which don't? > > > > > > > > > Signed-off-by: Song Liu <song@xxxxxxxxxx> > > > --- > > > arch/x86/mm/init_64.c | 3 ++- > > > mm/vmalloc.c | 24 ++++++++++++++++++++++++ > > > 2 files changed, 26 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > > > index 0fe690ebc269..d94f196c541a 100644 > > > --- a/arch/x86/mm/init_64.c > > > +++ b/arch/x86/mm/init_64.c > > > @@ -1367,12 +1367,13 @@ int __init > > > deferred_page_init_max_threads(const struct cpumask > > > *node_cpumask) > > > > > > int kernel_set_to_readonly; > > > > > > +#define PMD_ALIGN(x) (((unsigned long)(x) + (PMD_SIZE - > > > 1)) & > > > PMD_MASK) > > > void mark_rodata_ro(void) > > > { > > > unsigned long start = PFN_ALIGN(_text); > > > unsigned long rodata_start = PFN_ALIGN(__start_rodata); > > > unsigned long end = (unsigned > > > long)__end_rodata_hpage_align; > > > - unsigned long text_end = PFN_ALIGN(_etext); > > > + unsigned long text_end = PMD_ALIGN(_etext); > > > > This should probably have more logic and adjustments. If etext is > > PMD > > aligned, some of the stuff outside the diff won't do anything. > > Hmm.. I don't quite follow this comment. If the etext is PMD > aligned, > we can still use vmalloc_exec to allocate memory. So it shouldn't > matter, no? Maybe this doesn't matter since PMD alignment must happen naturally sometimes. I was just noticing the attempts to operate on this region between etext and start_rodata (free_init_pages(), etc). If this was never not PMD aligned they could be dropped. But if you are going to adjust the behavior for !CONFIG_MODULES, etc, then it is still needed.