Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg

Yafang Shao <laoar.shao@xxxxxxxxx> · Fri, 11 Mar 2022 20:48:27 +0800

On Fri, Mar 11, 2022 at 2:00 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
>
> On Thu, Mar 10, 2022 at 09:20:54PM +0800, Yafang Shao wrote:
> > On Thu, Mar 10, 2022 at 7:35 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
> > >
> > > On Wed, Mar 09, 2022 at 09:28:58PM +0800, Yafang Shao wrote:
> > > > On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote:
> > > > > > When we use memcg to limit the containers which load bpf progs and maps,
> > > > > > we find there is an issue that the lifecycle of container and bpf are not
> > > > > > always the same, because we may pin the maps and progs while update the
> > > > > > container only. So once the container which has alreay pinned progs and
> > > > > > maps is restarted, the pinned progs and maps are no longer charged to it
> > > > > > any more. In other words, this kind of container can steal memory from the
> > > > > > host, that is not expected by us. This patchset means to resolve this
> > > > > > issue.
> > > > > >
> > > > > > After the container is restarted, the old memcg which is charged by the
> > > > > > pinned progs and maps will be offline but won't be freed until all of the
> > > > > > related maps and progs are freed. If we want to charge these bpf memory to
> > > > > > the new started memcg, we should uncharge them from the offline memcg first
> > > > > > and then charge it to the new one. As we have already known how the bpf
> > > > > > memroy is allocated and freed, we can also know how to charge and uncharge
> > > > > > it. This pathset implements various charge and uncharge methords for these
> > > > > > memory.
> > > > > >
> > > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls
> > > > > > to do it. With the new implemented bpf syscall, the agent running in the
> > > > > > container can use it to do the recharge. As of now we only implement it for
> > > > > > the bpf hash maps. Below is a simple example how to do the recharge,
> > > > > >
> > > > > > ====
> > > > > > int main(int argc, char *argv[])
> > > > > > {
> > > > > >       union bpf_attr attr = {};
> > > > > >       int map_id;
> > > > > >       int pfd;
> > > > > >
> > > > > >       if (argc < 2) {
> > > > > >               printf("Pls. give a map id \n");
> > > > > >               exit(-1);
> > > > > >       }
> > > > > >
> > > > > >       map_id = atoi(argv[1]);
> > > > > >       attr.map_id = map_id;
> > > > > >       pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr));
> > > > > >       if (pfd < 0)
> > > > > >               perror("BPF_MAP_RECHARGE");
> > > > > >
> > > > > >       return 0;
> > > > > > }
> > > > > >
> > > > > > ====
> > > > > >
> > > > > > Patch #1 and #2 is for the observability, with which we can easily check
> > > > > > whether the bpf maps is charged to a memcg and whether the memcg is offline.
> > > > > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed,
> > > > > > kmalloc-ed and percpu memory.
> > > > > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used
> > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs
> > > > > > hasn't been implemented neither.
> > > > > >
> > > > > > This pathset is still a POC now, with limited testing. Any feedback is
> > > > > > welcomed.
> > > > >
> > > > > Hello Yafang!
> > > > >
> > > > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup
> > > > > offlining we either do nothing either recharge pages to the parent cgroup
> > > > > (latter is preferred), which helps to release the pinned memcg structure.
> > > > >
> > > >
> > > > We have thought about recharging pages to the parent cgroup (the root
> > > > memcg in our case),
> > > > but it can't resolve our issue.
> > > > Releasing the pinned memcg struct is the benefit of recharging pages
> > > > to the parent,
> > > > but as there won't be too many memcgs pinned by bpf, so it may not be worth it.
> > >
> > > I agree, that was my thinking too.
> > >
> > > >
> > > >
> > > > > Your approach raises some questions:
> > > >
> > > > Nice questions.
> > > >
> > > > > 1) what if the new cgroup is not large enough to contain the bpf map?
> > > >
> > > > The recharge is supposed to be triggered at the container start time.
> > > > After the container is started, the agent which will load the bpf
> > > > programs will do it as follows,
> > > > 1. Check if the bpf program has already been loaded,
> > > >     if not,  goto 5.
> > > > 2. Check if the bpf program will pin maps or progs,
> > > >     if not, goto 6.
> > > > 3. Check if the pinned maps and progs are charged to an offline memcg,
> > > >     if not, goto 6.
> > > > 4. Recharge the pinned maps or progs to the current memcg.
> > > >    goto 6.
> > > > 5. load new bpf program, and also pinned maps and progs if desired.
> > > > 6. End.
> > > >
> > > > If the recharge fails, it means that the memcg limit is too low, we
> > > > should reconsider
> > > > the limit of the container.
> > > >
> > > > Regarding other cases that it may do the recharge in the runtime, I
> > > > think the failure is
> > > > a common OOM case, that means the usage in this container is out of memory, we
> > > > should kill something.
> > >
> > > The problem here is that even invoking the oom killer might not help here,
> > > if the size of the bpf map is larger than memory.max.
> > >
> >
> > Then we should introduce a fallback.
>
> Can you, please, elaborate a bit more?
>
> >
> > > Also because recharging of a large object might take time and it's happening
> > > simultaneously with other processes in the system (e.g. memory allocations,
> > > cgroup limit changes, etc), potentially we might end up in the situation
> > > when the new cgroup is not large enough to include the transferred object,
> > > but also the original cgroup is not large enough (due to the limit set on one
> > > of it's ancestors), so we'll need to break memory.max of either cgroup,
> > > which is not great. We might solve this by pre-charging of target cgroup
> > > and keeping the double-charge during the process, but it might not work
> > > well for really large objects on small machines. Another approach is to transfer
> > > in small chunks (e.g. pages), but then we might end with a partially transferred
> > > object, which is also a questionable result.
> > >
> >
> > For this case it is not difficult to do the fallback because the
> > original one is restricted to an offline memcg only, that means there
> > are no any activities  in the original memcg. So recharge these pages
> > to the original one back will always succeed.
>
> The problem is that the original cgroup might be not a top-level cgroup.
> So even if it's offline, it doesn't really change anything: it's parent cgroup
> can be online and experience concurrent limits changes, allocations etc.
>
> >
> > > <...>
> > >
> > > > > Will reparenting work for your case? If not, can you, please, describe the
> > > > > problem you're trying to solve by recharging the memory?
> > > > >
> > > >
> > > > Reparenting doesn't work for us.
> > > > The problem is memory resource control: the limitation on the bpf
> > > > containers will be useless
> > > > if the lifecycle of bpf progs can containers are not the same.
> > > > The containers are always upgraded - IOW restarted - more frequently
> > > > than the bpf progs and maps,
> > > > that is also one of the reasons why we choose to pin them on the host.
> > >
> > > In general, I think I understand why this feature is useful for your case,
> > > however I do have some serious concerns about adding such feature to
> > > the upstream kernel:
> > > 1) The interface and the proposed feature is bpf-specific, however the problem
> > > isn't. The same issue (an under reported memory consumption) can be caused by
> > > other types of memory: pagecache, various kernel objects e.g. vfs cache etc.
> > > If we introduce such a feature, we'd better be consistent across various
> > > types of objects (how it's a good question).
> >
> > That is really a good question, which drives me to think more and
> > investigate more.
> >
> > Per my understanding the under reported pages can be divided into several cases,
> > 1) The pages aren't charged correctly when they are allocated.
> >    In this case, we should fix it when we allocate it.
> > 2) The pages should be recharged back to the original memcg
> >    The pages are charged correctly but then we lost track of it.
> >    In this case the kernel must introduce some way to keep track of
> > and recharge it back in the proper circumstance.
> > 3) Undistributed estate
> >    The original owner was dead, left with some persistent memory.
> >    Should the new one who uses this memory take charge of it?
> >
> > So case #3 is what we should discuss here.
>
> Right, this is the case I'm focused on too.
>
> A particular case is when there are multiple generations of the "same"
> workload each running in a new cgroup. Likely there is a lot of pagecache
> and vfs cache (and maybe bpf programs etc) is re-used by the second and
> newer generations, however they are accounted towards the first dead cgroup.
> So the memory consumption of the second and newer generations is systematically
> under-reported.
>

Right, the sharing pagecache pages and vfs cache are more complicated.
The trouble is that we don't have a clear rule on what they should
belong to. If we want to handle them, we must make the rule first that
1) Should we charge these pages to a specific memcg in the first place ?
    If not, things will be very easy. If yes, things will be very complicated.
    Unfortunately we selected the complicated way.
2) Now that we selected the complicated way, can we have a clear rule
to manage them ?
    Our current status is that let it be, and it doesn't matter what
they belong to as long as they have a memcg.

> >
> > Before answering the question, I will explain another option we have
> > thought about to fix our issue.
> > Instead of recharging the bpf memory in the bpf syscall, the other
> > option is to set the target memcg only in the syscall and then wake up
> > a kworker to do the recharge. That means separate the recharge into
> > two steps, 1) assign the inheritor, 2) transfer the estate.
> > At last we didn't choose it because we want an immediate error if the
> > new owner doesn't have large enough space.
>
> The problem is that we often don't know this in advance. Imagine a cgroup
> with memory.max set to 1Gb and current usage 0.8Gb. Can it fit a 0.5Gb bpf map?
> The true answer is it depends on whether we can reclaim extra 0.3Gb. And there
> is no way to say it for sure without making a real attempt to reclaim.
>
> > But this option can partly answer your question here, one possible way
> > to do it more generic is to abstract
> > two methods to get -
> > 1). Who is the best inheritor               =>  assigner
> > 2). How to charge the memory to it    =>  charger
> >
> > Then let consider the option we choose again, we can find that it can be
> > easily extended to work in that way,
> >
> >        assigner                             charger
> >
> >     bpf_syscall
> >        wakeup the charger            waken
> >        wait for the result                 do the recharge and give the result
> >        return the result
> >
> > In other words, we don't have a clear idea what issues we may face in
> > the future, but we know we can extend it to fix the new coming issue.
> > I think that is the most important thing.
> >
> > > 2) Moving charges is proven to be tricky and cause various problems in the past.
> > > If we're going back into this direction, we should come up with a really solid
> > > plan for how to avoid past issues.
> >
> > I know the reason why we disable move_charge_at_immigrate in cgroup2,
> > but I don't know if I know all of the past issues.
> > Appreciate if you could share the past issues you know and I will
> > check if they apply to this case as well.
>
> As I mentioned above, recharging is a complex and potentially long process,
> which can unexpectedly fail. And rolling it back is also tricky and not always
> possible without breaking other things.
> So there are difficulties with:
> 1) providing a reasonable interface,
> 2) implementing it in way which doesn't bring significant performance overhead.
>
> That said, I'm not saying it's not possible at all, but it's a serious open
> problem.
>
> > In order to avoid possible risks, I have restricted the recharge to
> > happen in very strict conditions,
> > 1. The original memcg must be an offline memcg
> > 2.  The target memcg must be the memcg of the one who calls the bpf syscall
> >      That means the outsider doesn't have a way to do the recharge.
> > 3. only kmem is supported now. (The may be extend it the future for
> > other types of memory)
> >
> > > 3) It would be great to understand who and how will use this feature in a more
> > > generic environment. E.g. is it useful for systemd? Is it common to use bpf maps
> > > over multiple cgroups? What for (given that these are not system-wide programs,
> > > otherwise why would we charge their memory to some specific container)?
> > >
> >
> > It is useful for containerized environments.
> > The container which pinned bpf can use it.
> > In our case we may use it in two ways as I explained in the prev mail that,
> > 1) The one who load the bpf who do the recharge
> > 2) A sidecar to maintain the bpf cycle
> >
> > For the systemd, it may need to do some extend that,
> > The bpf services should describe,
> > 1) if the bpf service needs the recharge (the one who limited by memcg
> > should be forcefully do the recharge)
> > 2) the pinned progs and maps to check
> > 3) the service identifier (with which we can get the target memcg)
> >
> > We don't have the case that the bpf map is shared by multiple cgroups,
> > that should be a rare case.
> > I think that case is similar to the sharing page caches across
> > multiple cgroups, which are used by many cgroups but only charged to
> > one specific memcg.
>
> I understand the case with the pagecache. E.g. we're running essentially the
> same workload in a new cgroup and it likely uses the same or similar set of
> files, it will actively use the pagecache created by the previous generation.
> And this can be a memcg-specific pagecache, which nobody except these cgroups is
> using.
>
> But what kind of bpf data has the same property? Why it has to be persistent
> across multiple generations of the same workload?
>

Ah, it can be considered as shared, between the bpf memcg and the root
memcg. While it can only be written by bpf memcg. For example, in the
root memcg, some networking facilities like clsact qdisc also read
these maps.

The key point is that the charging behavior must be consistent, either
always charged or always uncharged. That will be good for memory
resource management. It is bad that sometimes it gets charged while
sometimes not.

Another possible solution is to introduce a way to allow not to charge
pages, IOW these pages will be accounted to root only. If we go that
direction, things will get simpler. What do you think?

> In the end, if the data is not too big (and assuming it's not happening too
> often), it's possible to re-create the map and copy the data.
>

For one of our bpf services, the total size of its maps is around 1GB,
which is not small.

-- 
Thanks
Yafang