Mike, please Cc linux-modules if you want modules folks' input as well ;) On Tue, Mar 28, 2023 at 8:11 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > > OK, so you want to reduce that direct map fragmentation? > > Yes. > > > Is that a real problem? > > A while ago Intel folks published report [1] that showed better performance > with large pages in the direct map for majority of benchmarks. > > > My impression is that modules are mostly static thing. BPF > > might be a different thing though. I have a recollection that BPF guys > > were dealing with direct map fragmention as well. > > Modules are indeed static, but module_alloc() used by anything that > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > mentioned that having code in 2M pages reduces iTLB pressure [2], but > that's not only about avoiding the splits in the direct map but also about > using large mappings in the modules address space. It is easily overlooked why such things create direct fragmentation -- it's because of the special permission stuff done, module_alloc() targets memory which can be executed somehow. > BPF guys suggested an allocator for executable memory [3] mainly because > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. The performance metrics were completely opaque to some synthetic environment and our goal is to showcase real value with reproducible performance benchmarks. Since now Song is convinced that modules need to be a first class citizen in order to generalize a special allocator we may sooner rather than later real reproducible performance data to show the benefit of such a special allocator. One of the big differences with eBPF programs is that modules *can* be rather large in size. What is the average size of modules? Well let's take a look: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| wc -l 9173 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 175.1 175 MiB is pretty large. Ignoring the top 5 piggies: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r |head -5 58315248 - ./drivers/gpu/drm/amd/amdgpu/amdgpu.ko 29605592 - ./drivers/gpu/drm/i915/i915.ko 18591256 - ./drivers/gpu/drm/nouveau/nouveau.ko 16867048 - ./fs/xfs/xfs.ko 14209440 - ./fs/btrfs/btrfs.ko mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 160.54 Ignoring the top 10 largest modules: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-10)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 154.656 Ignoring the top 20 piggies: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-20)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 146.384 Ignoring the top 100 bloated modules: mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-100)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 117.97 Ignoring the top 1000 bloated modules: mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-1000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 64.4686 Ignoring the top 2000 bloated modules: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-2000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 48.7869 Ignoring top 3000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-3000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 39.6037 Ignoring top 4000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-4000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 33.106 Ignoring top 5000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-5000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 28.0925 But at least on the driver front we know we won't always have loaded *all* GPU drivers, but *one*. So the math needs to consider what module sizes are for modules actually used. Let's see what the average module size is on on a big system: mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 313.432 On a small desktop: mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 292.786 For each finit_module we also do a vmalloc twice actually, one for the kernel_read_*() which lets us read the file from userspace, and then a second set of allocations where we copy data over, each step either using vzalloc() or module_alloc() (vmalloc() with special permissions), this is more easily reflected and visible now on linux-next with Song's new module_memory_alloc(). However -- for the context of *this* effort, we're only talking about the executable sections, the areas we'd use module_alloc() for. On a big system: mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs size --radix=10 | awk '{print $1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 88.7964 On a small desktop: mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs size --radix=10 | awk '{print $1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 92.1473 Regardless, the amount of memory allocated is pretty significant, to the point a 400 CPU system easily run out of vmap space (yes the kernel does some stupid stuff, which we're correcting over time): https://lkml.kernel.org/r/20221013180518.217405-1-david@xxxxxxxxxx To help with this we have some recent efforts to help with this pressure on vmap space: https://lkml.kernel.org/r/20230311051712.4095040-1-mcgrof@xxxxxxxxxx > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > > > to implement the part of vmalloc() that reserves the virtual addresses and > > > maps the allocated memory there in module_alloc(). > > > > Another option would be to provide an allocator for the backing pages to > > vmalloc. But I do agree that a gfp flag is a less laborous way to > > achieve the same. So the primary question really is whether we really > > need vmalloc support for unmapped memory. > > I'm not sure I follow here. module_alloc() is essentially an alias to > vmalloc(), so to reduce direct map fragmentation caused by code allocations > the most sensible way IMO is to support unmapped memory in vmalloc(). The big win also would be to use huge pages from a performance point of view, specially reducing iTLB pressure, however that is *slightly* orthogonal to your goal of reducing direct map fragmentation. However, we should not ignore that strategy as it is a special use case which *could* be leveraged in other ways too. And so I'd prefer to see that over just a flag to hide those allocations. It would not only reduce that fragmentation, but reduce iTLB pressure which I think *is* what experimentation revealed to show more gains. > I also think vmalloc with unmmapped pages can provide backing pages for > execmem_alloc() Song proposed. Only if you consider huge pages. I don't see that here. Luis