Hello, On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote: ... ... > I'm under the impression that the real missed, undocumented, major value-add > here is that the old "BPF prog pack" strategy helps to reduce the direct map > fragmentation caused by heavy use of the eBPF JIT programs and this in > turn helps your overall random system performance (regardless of what > it is you do). As I see it then the eBPF prog pack is just one strategy to > try to mitigate memory fragmentation on the direct map caused by the the eBPF > JIT programs, so the "slow down" your team has obvserved should be due to the > eventual fragmentation caused on the direct map *while* eBPF programs > get heavily used. > > Mike Rapoport had presented about the Direct map fragmentation problem > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace / > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance > evaluation on whether using 2M/1G pages aggressively for the kernel direct map > help performance [1] ends up generally recommending huge pages. The work by Xing > though was about using huge pages *alone*, not using a strategy such as in the > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs, > and that I think is the real golden nugget here. I'm interested in how this patchset (further) improves direct map fragmentation so would like to evaluate it to see if my previous work to merge small mappings back in architecture layer[1] is still necessary. I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed, so I took one step back and evaluated the existing bpf_prog_pack. I'm aware of this patchset would make things even better by using order-9 page to backup the vmalloced range. I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to run, feel free to let me know a better way to evaluate this. - In kernels before bpf_prog_pack(v5.17 and earlier), this prog would cause 3 pages to change protection to RO+X from RW+NX; And if the three pages are far apart, each would cause a level 3 split then a level 2 split; Reality is, allocated pages tend to stay close physically so actual result will not be this bad. - After bpf_prog_pack, the load of this prog will most likely requires no new page protection change as long as the existing pack pool has space for it(the common case). The actual space required for this bpf prog that needs special protection is: 64 * 2 + 192 bytes, it would need 6144 such progs to use up the cache and trigger another 2MB alloc which can potentially cause a direct map split. 6144 seems to be pretty large number to me so I think the direct map split due to bpf is greatly improved (if not totally solved). Here are test results on a 8G x86_64 VM. (on x86_64, 4k is PTE mapping, 2M is PMD mapping and 1G is PUD mapping) - v5.17 1) right after boot $ grep Direct /proc/meminfo DirectMap4k: 87900 kB DirectMap2M: 5154816 kB DirectMap1G: 5242880 kB 2) after running 512 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 462684 kB DirectMap2M: 4780032 kB DirectMap1G: 5242880 kB PUD mapping survived, some PMD mappings are splitted. 3) after running 1024 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 884572 kB DirectMap2M: 6455296 kB DirectMap1G: 3145728 kB 2 PUD mappings and some PMD mappings splitted. 4) after run 2048 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 1654620 kB DirectMap2M: 6733824 kB DirectMap1G: 2097152 kB Another PUD mapping and some PMD mappings spliited. At the end, 2 PUD mappings survived. - v6.1-rc3 The direct map number doesn't change for "right after boot", "after running 512/1024/2048" sockex1 instances. I also tried running 5120 instances but it would consume all available memory when it runs about 4000 instances and even then, the direct map number doesn't change, i.e. not a single split happened. This is understandable because as I calculated above, it would need 6144 such progs to cause another alloc/split. Here is its number: $ grep Direct /proc/meminfo DirectMap4k: 22364 kB DirectMap2M: 3123200 kB DirectMap1G: 7340032 kB Consider that production system will have memory mostly consumed and when CPU allocates pages, the page can be far apart than systems right after boot, causing more mapping split, so I also tested to firstly consume all memory to page cache by reading some sparse files and then run this test. I expect this time v5.17 would become worse. Here it is: - v5.17 1) right after boot $ grep Direct /proc/meminfo DirectMap4k: 94044 kB DirectMap2M: 4100096 kB DirectMap1G: 6291456 kB More mappings are in PUD for this boot. 2) after run 512 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 538460 kB DirectMap2M: 7849984 kB DirectMap1G: 2097152 kB 4 PUD mappings and some PMD mappings are splitted this time, more than last time. 3) after run 1024 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 1083228 kB DirectMap2M: 7305216 kB DirectMap1G: 2097152 kB Some PMD mappings split. 4) after running 2048 sockex1 instances concurrently $ grep Direct /proc/meminfo DirectMap4k: 2340700 kB DirectMap2M: 6047744 kB DirectMap1G: 2097152 kB The end result is about the same as before. - v6.1-rc3 There is no difference because I can't trigger another pack alloc before system is OOMed. Conclusion: I think bpf_prog_pack is very good at reducing direct map fragmentation and this patchset can further improve this situation on large machines(with huge amount of memory) or with more large bpf progs loaded etc. Some imperfect things I can think of are(not related to this patchset): 1 Once a split happened, it remains happened. This may not be a big deal now with bpf_prog_pack and this patchset because the need to allocate a new order-9 page and thus cause a potential split should happen much much less; 2 When a new order-9 page has to be allocated, there is no way to tell the allocator to allocate this order-9 page from an already splitted PUD range to avoid another PUD mapping split; 3 As Mike and others have mentioned, there are other users that can also cause direct map split. [1]: https://lore.kernel.org/lkml/20220808145649.2261258-1-aaron.lu@xxxxxxxxx/ Regards, Aaron