Re: [PATCH 00/14] bpf: Allow not to charge bpf memory

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 23 Mar 2022 00:10:34 +0800

On Tue, Mar 22, 2022 at 6:52 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
>
> Hello, Yafang!
>
> Thank you for continuing working on this!
>
> On Sat, Mar 19, 2022 at 05:30:22PM +0000, Yafang Shao wrote:
> > After switching to memcg-based bpf memory accounting, the bpf memory is
> > charged to the loader's memcg by defaut, that causes unexpected issues for
> > us. For instance, the container of the loader-which loads the bpf programs
> > and pins them on bpffs-may restart after pinning the progs and maps. After
> > the restart, the pinned progs and maps won't belong to the new container
> > any more, while they actually belong to an offline memcg left by the
> > previous generation. That inconsistent behavior will make trouble for the
> > memory resource management for this container.
>
> I'm looking at this text and increasingly feeling that it's not a bpf-specific
> problem and it shouldn't be solved as one.
>

I'm not sure whether it is a common issue or not, but I'm sure bpf has
its special attribute that we should handle it specifically.  I can
show you an example on why bpf is a special one.

The pinned bpf is similar to a kernel module, right?
But that issue can't happen in a kernel module, while it can happen in
bpf only.  The reason is that the kernel module has the choice on
whether account the allocated memory or not, e.g.
    - Account
      kmalloc(size,  GFP_KERNEL | __GFP_ACCOUNT);
   - Not Account
      kmalloc(size, GFP_KERNEL);

While the bpf has no choice because the GFP_KERNEL is a KAPI which is
not exposed to the user.

Then the issue is exposed when the memcg-based accounting is
forcefully enabled to all bpf programs. That is a behavior change,
while unfortunately we don't give the user a chance to keep the old
behavior unless they don't use memcg....

But that is not to say the memcg-based accounting is bad, while it is
really useful, but it needs to be improved. We can't expose
GFP_ACCOUNT to the user, but we can expose a wrapper of GFP_ACCOUNT to
the user, that's what this patchset did, like what we always have done
in bpf.

> Is there any significant reason why the loader can't temporarily enter the
> root cgroup before creating bpf maps/progs? I know it will require some changes
> in the userspace code, but so do new bpf flags.
>

On our k8s environment, the user agents should be deployed in a
Daemonset[1].  It will make more trouble to temporarily enter the root
group before creating bpf maps/progs, for example we must change the
way we used to deploy user agents, that will be a big project.

[1]. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

> >
> > The reason why these progs and maps have to be persistent across multiple
> > generations is that these progs and maps are also used by other processes
> > which are not in this container. IOW, they can't be removed when this
> > container is restarted. Take a specific example, bpf program for clsact
> > qdisc is loaded by a agent running in a container, which not only loads
> > bpf program but also processes the data generated by this program and do
> > some other maintainace things.
> >
> > In order to keep the charging behavior consistent, we used to consider a
> > way to recharge these pinned maps and progs again after the container is
> > restarted, but after the discussion[1] with Roman, we decided to go
> > another direction that don't charge them to the container in the first
> > place. TL;DR about the mentioned disccussion: recharging is not a generic
> > solution and it may take too much risk.
> >
> > This patchset is the solution of no charge. Two flags are introduced in
> > union bpf_attr, one for bpf map and another for bpf prog. The user who
> > doesn't want to charge to current memcg can use these two flags. These two
> > flags are only permitted for sys admin as these memory will be accounted to
> > the root memcg only.
>
> If we're going with bpf-specific flags (which I'd prefer not to), let's
> define them as the way to create system-wide bpf objects which are expected
> to outlive the original cgroup. With expectations that they will be treated
> as belonging to the root cgroup by any sort of existing or future resource
> accounting (e.g. if we'll start accounting CPU used by bpf prgrams).
>

Now that talking about the cpu resource, I have some more complaints
that cpu cgroup does really better than memory cgroup. Below is the
detailed information why cpu cgroup does a good job,

   - CPU
                        Task Cgroup
      Code          CPU time is accounted to the one who is executeING
 this code

   - Memory
                         Memory Cgroup
      Data           Memory usage is accounted to the one who
allocatED this data.

Have you found the difference?
The difference is that, cpu time is accounted to the one who is using
it (that is reasonable), while memory usage is accounted to the one
who allocated it (that is unreasonable). If we split the Data/Code
into private and shared, we can find why it is unreasonable.

                                Memory Cgroup
Private Data           Private and thus accounted to one single memcg, good.
Shared Data           Shared but accounted to one single memcg, bad.

                                Task Cgroup
Private Code          Private and thus accounted to one single task group, good.
Shared Code          Shared and thus accounted to all the task groups, good.

The pages are accounted when they are allocated rather than when they
are used, that is why so many ridiculous things happen.   But we have
a common sense that we can’t dynamically charge the page to the
process who is accessing it, right?  So we have to handle the issues
caused by shared pages case by case.

> But then again: why just not create them in the root cgroup?
>

As I explained above, that is limited by the deployment.

-- 
Thanks
Yafang