On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote: > > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > > Mike Rapoport had presented about the Direct map fragmentation > > > > problem > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > > ftrace / > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > > performance > > > > evaluation on whether using 2M/1G pages aggressively for the > > > > kernel direct map > > > > help performance [1] ends up generally recommending huge pages. > > > > The work by Xing > > > > though was about using huge pages *alone*, not using a strategy > > > > such as in the > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > > programs, > > > > and that I think is the real golden nugget here. > > > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > > by using > > > > huge pages for "bpf prog pack" is not what gets your systems to > > > > perform > > > > somehow better. It should be simply that it reduces fragmentation > > > > and > > > > *this* generally can help with performance long term. If this is > > > > accurate > > > > then let's please separate the two aspects to this. > > > > > > The direct map fragmentation is the reason for higher TLB miss > > > rate, both > > > for iTLB and dTLB. > > > > OK so then whatever benchmark is running in tandem as eBPF JIT is > > hammered > > should *also* be measured with perf for iTLB and dTLB. ie, the patch > > can > > provide such results as a justifications. > > Song had done some tests on the old prog pack version that to me seemed > to indicate most (or possibly all) of the benefit was direct map > fragmentation reduction. This was surprised me, since 2MB kernel text > has shown to be beneficial. > > Otherwise +1 to all these comments. This should be clear about what the > benefits are. I would add, that this is also much nicer about TLB > shootdowns than the existing way of loading text and saves some memory. > > So I think there are sort of four areas of improvements: > 1. Direct map fragmentation reduction (dTLB miss improvements). This > sort of does it as a side effect in this series, and the solution Mike > is talking about is a more general, probably better one. > 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent > solution for this, but surprisingly it doesn't seem to be useful for > JITs. (modules testing TBD) > 3. Loading text to reused allocation with per-cpu mappings. This > reduces TLB shootdowns, which are a short term load and teardown time > performance drag. My understanding is this is more of a problem on > bigger systems with many CPUs. This series does a decent job at this, > but the solution is not compatible with modules. Maybe ok since modules > don't load as often as JITs. > 4. Having BPF progs share pages. This saves memory. This series could > probably easily get a number for how much. > Hi Luis, Rick, and Mike, Thanks a lot for helping me organize this information. Totally agree with all these comments. I will add more data to the next revision. Besides the motivation improvement, could you please also share your comments on: 1. The logic/design of the vmalloc_exec() et. al. APIs; 2. The naming of these functions. Does execmem_[alloc|free|fill|cpy] (as suggested by Chritoph) sound good? Thanks, Song