Mike! On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote: > On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote: >> On Wed, Nov 30 2022 at 08:18, Song Liu wrote: >> The symptom is iTLB pressure. The root cause is the way how module >> memory is allocated, which in turn causes the fragmentation into >> 4k PTEs. That's the same problem for anything which uses module_alloc() >> to get space for text allocated, e.g. kprobes, tracing.... > > There's also dTLB pressure caused by the fragmentation of the direct map. > The memory allocated with module_alloc() is a priori mapped with 4k PTEs, > but setting RO in the malloc address space also updates the direct map > alias and this causes splits of large pages. > > It's not clear what causes more performance improvement: avoiding splits of > large pages in the direct map or reducing iTLB pressure by backing text > memory with 2M pages. >From our experiments when doing the first version of the SKX retbleed mitigation, the main improvement came from reducing iTLB pressure simply because the iTLB cache is really small. The kernel text placement is way beyond suboptimal. If you really do a hotpath analysis and (manually) place all hot code into one or two 2M pages, then you can achieve massive performance improvements way above the 10% range. We currently have a master student investigating this, but it will take some time until usable results materialize. > If the major improvement comes from keeping direct map intact, it's > might be possible to mix data and text in the same 2M page. No. That can't work. text = RX data = RW or RO If you mix this, then you end up with RWX for the whole 2M page. Not an option really as you lose _all_ protections in one go. That's why I said: >> As a logical next step we make that three blocks and allocate text, >> data and rodata separately, which will preserve the large mappings for >> text and data. rodata still needs to be split because we need a space to >> accomodate ro_after_init data. The point is, that rodata and ro_after_init_data is a pretty small portion of modules as far as my limited analysis of a distro build shows. The bulk is in text and data. So if we preserve 2M pages for text and for RW data and bite the bullet to split one 2M page for ro[_after_init_]data, we get the maximum benefit for the least complexity. >> But at the end we want an allocation mechanism which: >> >> - preserves large mappings >> - handles a distinct address range >> - is mapping type aware >> >> That solves _all_ the issues of modules, kprobes, tracing, bpf in one >> go. See? > > There is also > > - handles kaslr > > and at least for arm and powerpc we'd also need > > - handles architecture specific range restrictions and fallbacks Good points. kaslr should be fairly trivial. The architecture specific restrictions and fallbacks are not really hard to solve either. If done right then the allocator just falls back to 4k maps during initialization in early boot which brings it back to the status quo. But we can provide consistent semantics for the three types which are required for modules and the text only usage for kprobes, tracing, bpf... Thanks, tglx