After switching to memcg-based bpf memory accounting, the bpf memory is charged to the loader's memcg by default, that causes unexpected issues for us. For instance, the container of the loader may be restarted after pinning progs and maps, but the bpf memcg will be left and pinned on the system. Once the loader's new generation container is started, the leftover pages won't be charged to it. That inconsistent behavior will make trouble for the memory resource management for this container. In the past few days, I have proposed two patchsets[1][2] to try to resolve this issue, but in both of these two proposals the user code has to be changed to adapt to it, that is a pain for us. This patchset relieves the pain by triggering the recharge in libbpf. It also addresses Roman's critical comments. The key point we can avoid changing the user code is that there's a resue path in libbpf. Once the bpf container is restarted again, it will try to re-run the required bpf programs, if the bpf programs are the same with the already pinned one, it will reuse them. To make sure we either recharge all of them successfully or don't recharge any of them. The recharge prograss is divided into three steps: - Pre charge to the new generation To make sure once we uncharge from the old generation, we can always charge to the new generation succeesfully. If we can't pre charge to the new generation, we won't allow it to be uncharged from the old generation. - Uncharge from the old generation After pre charge to the new generation, we can uncharge from the old generation. - Post charge to the new generation Finnaly we can set pages' memcg_data to the new generation. In the pre charge step, we may succeed to charge some addresses, but fail to charge a new address, then we should uncharge the already charged addresses, so another recharge-err step is instroduced. This pachset has finished recharging bpf hash map. which is mostly used by our bpf services. The other maps hasn't been implemented yet. The bpf progs hasn't been implemented neither. The prev generation and the new generation may have the same parant, that can be optimized in the future. In the disccussion with Roman in the previous two proposals, he also mentioned that the leftover page caches have similar issue. There're key differences between leftover page caches and leftover bpf programs: - The leftover page caches may not be reused again Because once a container exited, it may be deployed on another host next time for better resource management. That's why we fix leftover page caches by _trying_ to drop all its page caches when it is exiting. But regarding the bpf conatainer, it will always be deployed on the same host next time, that's why bpf programs are pinned. - The lefeover page caches can be reclaimed, but bpf memory can't. It means the leftover page caches can be accepted while the leftover bpf memory can't. Regardless of these differences, we can also extend this method to recharge leftover page caches if we need it, for example when we 'reuse' a leftover inode, we recharge all its page caches to the new generation. But unforunately there's no such a clear reuse path in page cache layer, so we must build a resue path for it first: page cache's reuse path(X) bpf's reuse path | | ------------------ ------------- | page cache layer| | bpf layer | ------------------ ------------- \ / page cache's recharge handler(X) bpf's recharge handler \ / ------------------------------------ | Memcg layer | |----------------------------------| [1] https://lwn.net/Articles/887180/ [2] https://lwn.net/Articles/888549/ Yafang Shao (10): mm, memcg: Add a new helper memcg_should_recharge() bpftool: Show memcg info of bpf map mm, memcg: Add new helper obj_cgroup_from_current() mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public mm: Add helper to recharge kmalloc'ed address mm: Add helper to recharge vmalloc'ed address mm: Add helper to recharge percpu address bpf: Recharge memory when reuse bpf map bpf: Make bpf_map_{save, release}_memcg public bpf: Support recharge for hash map include/linux/bpf.h | 23 ++++++ include/linux/memcontrol.h | 22 ++++++ include/linux/percpu.h | 1 + include/linux/slab.h | 18 +++++ include/linux/vmalloc.h | 2 + include/uapi/linux/bpf.h | 4 +- kernel/bpf/hashtab.c | 74 +++++++++++++++++++ kernel/bpf/syscall.c | 40 ++++++----- mm/memcontrol.c | 35 +++++++-- mm/percpu.c | 98 ++++++++++++++++++++++++++ mm/slab.c | 85 ++++++++++++++++++++++ mm/slob.c | 7 ++ mm/slub.c | 125 +++++++++++++++++++++++++++++++++ mm/util.c | 9 +++ mm/vmalloc.c | 87 +++++++++++++++++++++++ tools/bpf/bpftool/map.c | 2 + tools/include/uapi/linux/bpf.h | 4 +- tools/lib/bpf/libbpf.c | 2 +- 18 files changed, 609 insertions(+), 29 deletions(-) -- 2.17.1