On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote: > > > Mike Rapoport had presented about the Direct map fragmentation > > > problem > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF / > > > ftrace / > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021 > > > performance > > > evaluation on whether using 2M/1G pages aggressively for the > > > kernel direct map > > > help performance [1] ends up generally recommending huge pages. > > > The work by Xing > > > though was about using huge pages *alone*, not using a strategy > > > such as in the > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF > > > programs, > > > and that I think is the real golden nugget here. > > > > > > I contend therefore that the theoretical reduction of iTLB misses > > > by using > > > huge pages for "bpf prog pack" is not what gets your systems to > > > perform > > > somehow better. It should be simply that it reduces fragmentation > > > and > > > *this* generally can help with performance long term. If this is > > > accurate > > > then let's please separate the two aspects to this. > > > > The direct map fragmentation is the reason for higher TLB miss > > rate, both > > for iTLB and dTLB. > > OK so then whatever benchmark is running in tandem as eBPF JIT is > hammered > should *also* be measured with perf for iTLB and dTLB. ie, the patch > can > provide such results as a justifications. Song had done some tests on the old prog pack version that to me seemed to indicate most (or possibly all) of the benefit was direct map fragmentation reduction. This was surprised me, since 2MB kernel text has shown to be beneficial. Otherwise +1 to all these comments. This should be clear about what the benefits are. I would add, that this is also much nicer about TLB shootdowns than the existing way of loading text and saves some memory. So I think there are sort of four areas of improvements: 1. Direct map fragmentation reduction (dTLB miss improvements). This sort of does it as a side effect in this series, and the solution Mike is talking about is a more general, probably better one. 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent solution for this, but surprisingly it doesn't seem to be useful for JITs. (modules testing TBD) 3. Loading text to reused allocation with per-cpu mappings. This reduces TLB shootdowns, which are a short term load and teardown time performance drag. My understanding is this is more of a problem on bigger systems with many CPUs. This series does a decent job at this, but the solution is not compatible with modules. Maybe ok since modules don't load as often as JITs. 4. Having BPF progs share pages. This saves memory. This series could probably easily get a number for how much.