Re: [PATCH bpf-next v1 RESEND 1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec

Aaron Lu <aaron.lu@xxxxxxxxx> · Mon, 7 Nov 2022 14:40:14 +0800

Hello,

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:

... ...

> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.
> 
> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.

I'm interested in how this patchset (further) improves direct map
fragmentation so would like to evaluate it to see if my previous work to
merge small mappings back in architecture layer[1] is still necessary.

I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed,
so I took one step back and evaluated the existing bpf_prog_pack. I'm
aware of this patchset would make things even better by using order-9
page to backup the vmalloced range.

I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to
run, feel free to let me know a better way to evaluate this.

- In kernels before bpf_prog_pack(v5.17 and earlier), this prog would
cause 3 pages to change protection to RO+X from RW+NX; And if the three
pages are far apart, each would cause a level 3 split then a level 2
split; Reality is, allocated pages tend to stay close physically so
actual result will not be this bad.

- After bpf_prog_pack, the load of this prog will most likely requires
no new page protection change as long as the existing pack pool has space
for it(the common case). The actual space required for this bpf prog that
needs special protection is: 64 * 2 + 192 bytes, it would need 6144 such
progs to use up the cache and trigger another 2MB alloc which can
potentially cause a direct map split. 6144 seems to be pretty large number
to me so I think the direct map split due to bpf is greatly improved
(if not totally solved).

Here are test results on a 8G x86_64 VM.
(on x86_64, 4k is PTE mapping, 2M is PMD mapping and 1G is PUD mapping)
- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       87900 kB
DirectMap2M:     5154816 kB
DirectMap1G:     5242880 kB

2) after running 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      462684 kB
DirectMap2M:     4780032 kB
DirectMap1G:     5242880 kB
PUD mapping survived, some PMD mappings are splitted.

3) after running 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      884572 kB
DirectMap2M:     6455296 kB
DirectMap1G:     3145728 kB
2 PUD mappings and some PMD mappings splitted.

4) after run 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1654620 kB
DirectMap2M:     6733824 kB
DirectMap1G:     2097152 kB
Another PUD mapping and some PMD mappings spliited.
At the end, 2 PUD mappings survived.

- v6.1-rc3
The direct map number doesn't change for "right after boot", "after
running 512/1024/2048" sockex1 instances. I also tried running 5120
instances but it would consume all available memory when it runs about
4000 instances and even then, the direct map number doesn't change, i.e.
not a single split happened. This is understandable because as I
calculated above, it would need 6144 such progs to cause another
alloc/split. Here is its number:
$ grep Direct /proc/meminfo
DirectMap4k:       22364 kB
DirectMap2M:     3123200 kB
DirectMap1G:     7340032 kB

Consider that production system will have memory mostly consumed and when
CPU allocates pages, the page can be far apart than systems right after
boot, causing more mapping split, so I also tested to firstly consume all
memory to page cache by reading some sparse files and then run this test.
I expect this time v5.17 would become worse. Here it is:

- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       94044 kB
DirectMap2M:     4100096 kB
DirectMap1G:     6291456 kB
More mappings are in PUD for this boot.

2) after run 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      538460 kB
DirectMap2M:     7849984 kB
DirectMap1G:     2097152 kB
4 PUD mappings and some PMD mappings are splitted this time, more than
last time.

3) after run 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1083228 kB
DirectMap2M:     7305216 kB
DirectMap1G:     2097152 kB
Some PMD mappings split.

4) after running 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     2340700 kB
DirectMap2M:     6047744 kB
DirectMap1G:     2097152 kB
The end result is about the same as before.

- v6.1-rc3
There is no difference because I can't trigger another pack alloc before
system is OOMed.

Conclusion: I think bpf_prog_pack is very good at reducing direct map
fragmentation and this patchset can further improve this situation on
large machines(with huge amount of memory) or with more large bpf progs
loaded etc.

Some imperfect things I can think of are(not related to this patchset):
1 Once a split happened, it remains happened. This may not be a big deal
now with bpf_prog_pack and this patchset because the need to allocate a
new order-9 page and thus cause a potential split should happen much much
less;
2 When a new order-9 page has to be allocated, there is no way to tell
the allocator to allocate this order-9 page from an already splitted PUD
range to avoid another PUD mapping split;
3 As Mike and others have mentioned, there are other users that can also
cause direct map split.

[1]: https://lore.kernel.org/lkml/20220808145649.2261258-1-aaron.lu@xxxxxxxxx/

Regards,
Aaron