Re: [PATCH 00/14] bpf: Allow not to charge bpf memory

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 23 Mar 2022 09:37:40 +0800

On Wed, Mar 23, 2022 at 3:10 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
>
> On Wed, Mar 23, 2022 at 12:10:34AM +0800, Yafang Shao wrote:
> > On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
> > >
> > > Hello, Yafang!
> > >
> > > Thank you for continuing working on this!
> > >
> > > On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > > > After switching to memcg-based bpf memory accounting, the bpf memory is
> > > > charged to the loader's memcg by defaut, that causes unexpected issues for
> > > > us. For instance, the container of the loader-which loads the bpf programs
> > > > and pins them on bpffs-may restart after pinning the progs and maps. After
> > > > the restart, the pinned progs and maps won't belong to the new container
> > > > any more, while they actually belong to an offline memcg left by the
> > > > previous generation. That inconsistent behavior will make trouble for the
> > > > memory resource management for this container.
> > >
> > > I'm looking at this text and increasingly feeling that it's not a bpf-specific
> > > problem and it shouldn't be solved as one.
> > >
> >
> > I'm not sure whether it is a common issue or not, but I'm sure bpf has
> > its special attribute that we should handle it specifically.  I can
> > show you an example on why bpf is a special one.
> >
> > The pinned bpf is similar to a kernel module, right?
> > But that issue can't happen in a kernel module, while it can happen in
> > bpf only.  The reason is that the kernel module has the choice on
> > whether account the allocated memory or not, e.g.
> >     - Account
> >       kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
> >    - Not Account
> >       kmalloc(size, GFP_KERNEL);
> >
> > While the bpf has no choice because the GFP_KERNEL is a KAPI which is
> > not exposed to the user.
>
> But if your process opens a file, creates a pipe etc there are also
> kernel allocations happening and the process has no control over whether
> these allocations are accounted or not. The same applies for the anonymous
> memory and pagecache as well, so it's not even kmem-specific.
>

So what is the real problem in practice ?
Does anyone complain about it ?

[At least,  there's no behavior change in these areas.]

> >
> > Then the issue is exposed when the memcg-based accounting is
> > forcefully enabled to all bpf programs. That is a behavior change,
> > while unfortunately we don't give the user a chance to keep the old
> > behavior unless they don't use memcg....
> >
> > But that is not to say the memcg-based accounting is bad, while it is
> > really useful, but it needs to be improved. We can't expose
> > GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
> > the user, that's what this patchset did, like what we always have done
> > in bpf.
> >
> > > Is there any significant reason why the loader can't temporarily enter the
> > > root cgroup before creating bpf maps/progs? I know it will require some changes
> > > in the userspace code, but so do new bpf flags.
> > >
> >
> > On our k8s environment, the user agents should be deployed in a
> > Daemonset[1].  It will make more trouble to temporarily enter the root
> > group before creating bpf maps/progs, for example we must change the
> > way we used to deploy user agents, that will be a big project.
>
> I understand, however introducing new kernel interfaces to overcome such
> things has its own downside: every introduced interface will stay pretty
> much forever and we'll _have_ to support it. Kernel interfaces have a very long
> life cycle, we have to admit it.
>
> The problem you're describing - inconsistencies on accounting of shared regions
> of memory - is a generic cgroup problem, which has a configuration solution:
> the resource accounting and control should be performed on a stable level and
> actual workloads can be (re)started in sub-cgroups with optionally disabled
> physical controllers.
> E.g.:
>                         /
>         workload2.slice   workload1.slice     <- accounting should be performed here
> workload_gen1.scope workload_gen2.scope ...
>

I think we talked about it several days earlier.

>
> >
> > [1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
> >
> > > >
> > > > The reason why these progs and maps have to be persistent across multiple
> > > > generations is that these progs and maps are also used by other processes
> > > > which are not in this container. IOW, they can't be removed when this
> > > > container is restarted. Take a specific example, bpf program for clsact
> > > > qdisc is loaded by a agent running in a container, which not only loads
> > > > bpf program but also processes the data generated by this program and do
> > > > some other maintainace things.
> > > >
> > > > In order to keep the charging behavior consistent, we used to consider a
> > > > way to recharge these pinned maps and progs again after the container is
> > > > restarted, but after the discussion[1] with Roman, we decided to go
> > > > another direction that don't charge them to the container in the first
> > > > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > > > solution and it may take too much risk.
> > > >
> > > > This patchset is the solution of no charge. Two flags are introduced in
> > > > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > > > doesn't want to charge to current memcg can use these two flags. These two
> > > > flags are only permitted for sys admin as these memory will be accounted to
> > > > the root memcg only.
> > >
> > > If we're going with bpf-specific flags (which I'd prefer not to), let's
> > > define them as the way to create system-wide bpf objects which are expected
> > > to outlive the original cgroup. With expectations that they will be treated
> > > as belonging to the root cgroup by any sort of existing or future resource
> > > accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
> > >
> >
> > Now that talking about the cpu resource, I have some more complaints
> > that cpu cgroup does really better than memory cgroup. Below is the
> > detailed information why cpu cgroup does a good job,
> >
> >    - CPU
> >                         Task Cgroup
> >       Code          CPU time is accounted to the one who is executeING
> >  this code
> >
> >    - Memory
> >                          Memory Cgroup
> >       Data           Memory usage is accounted to the one who
> > allocatED this data.
> >
> > Have you found the difference?
>
> Well, RAM is a vastly different thing than CPU :)
> They have different physical properties and corresponding accounting limitations.
>
> > The difference is that, cpu time is accounted to the one who is using
> > it (that is reasonable), while memory usage is accounted to the one
> > who allocated it (that is unreasonable). If we split the Data/Code
> > into private and shared, we can find why it is unreasonable.
> >
> >                                 Memory Cgroup
> > Private Data           Private and thus accounted to one single memcg, good.
> > Shared Data           Shared but accounted to one single memcg, bad.
> >
> >                                 Task Cgroup
> > Private Code          Private and thus accounted to one single task group, good.
> > Shared Code          Shared and thus accounted to all the task groups, good.
> >
> > The pages are accounted when they are allocated rather than when they
> > are used, that is why so many ridiculous things happen.   But we have
> > a common sense that we can’t dynamically charge the page to the
> > process who is accessing it, right?  So we have to handle the issues
> > caused by shared pages case by case.
>
> The accounting of shared regions of memory is complex because of two
> physical limitations:
>
> 1) Amount of (meta)data which we can be used to track ownership. We expect
> the memory overhead be small in comparison to the accounted data. If a page
> is used by many cgroups, even saving a single pointer to each cgroup can take
> a lot of space. Even worse for slab objects. At some point it stops making
> sense: if the accounting takes more memory than the accounted memory, it's
> better to not account at all.
>
> 2) CPU overhead: tracking memory usage beyond the initial allocation adds
> an overhead to some very hot paths. Imagine two processes mapping the same file,
> first processes faults in the whole file and the second just uses the pagecache.
> Currently it's very efficient. Causing the second process to change the ownership
> information on each minor page fault will lead to a performance regression.
> Think of libc binary as this file.
>
> That said, I'm not saying it can't be done better that now. But it's a complex
> problem.
>

Memcg-based accounting is also complex, but it introduces user visible
behavior change now :(

-- 
Thanks
Yafang