Re: [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 22, 2022 at 7:28 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by default, that causes unexpected issues for
> > us. For instance, the container of the loader may be restarted after
> > pinning progs and maps, but the bpf memcg will be left and pinned on the
> > system. Once the loader's new generation container is started, the leftover
> > pages won't be charged to it. That inconsistent behavior will make trouble
> > for the memory resource management for this container.
> >
> > In the past few days, I have proposed two patchsets[1][2] to try to resolve
> > this issue, but in both of these two proposals the user code has to be
> > changed to adapt to it, that is a pain for us. This patchset relieves the
> > pain by triggering the recharge in libbpf. It also addresses Roman's
> > critical comments.
> >
> > The key point we can avoid changing the user code is that there's a resue
> > path in libbpf. Once the bpf container is restarted again, it will try
> > to re-run the required bpf programs, if the bpf programs are the same with
> > the already pinned one, it will reuse them.
> >
> > To make sure we either recharge all of them successfully or don't recharge
> > any of them. The recharge prograss is divided into three steps:
> >   - Pre charge to the new generation
> >     To make sure once we uncharge from the old generation, we can always
> >     charge to the new generation succeesfully. If we can't pre charge to
> >     the new generation, we won't allow it to be uncharged from the old
> >     generation.
> >   - Uncharge from the old generation
> >     After pre charge to the new generation, we can uncharge from the old
> >     generation.
> >   - Post charge to the new generation
> >     Finnaly we can set pages' memcg_data to the new generation.
> > In the pre charge step, we may succeed to charge some addresses, but fail
> > to charge a new address, then we should uncharge the already charged
> > addresses, so another recharge-err step is instroduced.
> >
> > This pachset has finished recharging bpf hash map. which is mostly used
> > by our bpf services. The other maps hasn't been implemented yet. The bpf
> > progs hasn't been implemented neither.
>
> ... and the implementation in patch 10 is missing recharge of htab->elems
> which consumes the most memory.

You got to the point. I intended to skip htab->elemes due to some reasons.
You have pointed out one of the reasons that it is complex for
non-preallocated memory.
Another reason is memcg reparenting, for example, once the memcg is
offline, its pages and other related memory things like obj_cgroup
will be reparented, but unlikely the map->memcg is still the offline
memcg. There _must_ be issues in it, but I haven't figured out what it
may cause to the user. One issue I can imagine is that the memcg limit
may not work well in this case.  That should be another different
issue, and I'm still working on it.

> That begs the question how the whole set was tested.

In my test case, htab->buckets is around 120MB, which can be used to
do the comparison.  The number is charged back without any kernel
warnings, so I posted this incomplete patchset to get feedback to
check if I'm in the right direction.

>
> Even if that bug is fixed this recharge approach works only with preallocated
> maps. Their use will be reduced in the future.
> Maps with kmalloc won't work with this multi step approach.
> There is no place where bpf_map_release_memcg can be done without racing
> with concurrent kmallocs from bpf program side.

Right, that is really an issue.

>
> Despite being painful for user space the user space has to deal with it.
> It created a container, charged its memcg, then destroyed the container,
> but didn't free the bpf map. It's a user bug. It has to free the map.

It seems that I didn't describe this case clearly.
It is not a user bug but a user case in the k8s environment. For
example, after the user has deployed its container, it may need to
upgrade its user application or bpf prog, then the user should upgrade
its container, that means the old container will be destroyed and a
new container will be created.  In this upgrade progress, the bpf
program can continue to work without interruption because they are
pinned.

> The user space can use map-in-map solution. In the new container the new bpf map
> can be allocated (and charged to new memcg), the data copied from old map,
> and then inner map replaced. At this point old map can be freed and memcg
> uncharged.

The map-in-map solution may work for some cases, but it will allocate
more memory, which is not okay if the bpf map has lots of memory.

> To make things easier we can consider introducing an anon FD that points to a memcg.
> Then user can pick a memcg at map creation time instead of get_mem_cgroup_from_current.

This may be a workable solution.  The life cycle of a pinned bpf
program is independent of the application which loads an pins it, so
it is reasonable to introduce an independent memcg to manage the bpf
memory, for example a memcg like /sys/fs/cgroup/memory/pinned_bpf
which is independent of the k8s pod.
I will analyze it. Thanks for the suggestion.

-- 
Regards
Yafang




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux