On Sun, Jun 26, 2022 at 11:32 AM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > On Sat, Jun 25, 2022 at 08:28:37PM -0700, Roman Gushchin wrote: > > On Sat, Jun 25, 2022 at 11:26:13AM +0800, Yafang Shao wrote: > > > On Thu, Jun 23, 2022 at 11:29 AM Roman Gushchin > > > <roman.gushchin@xxxxxxxxx> wrote: > > > > > > > > On Sun, Jun 19, 2022 at 03:50:22PM +0000, Yafang Shao wrote: > > > > > After switching to memcg-based bpf memory accounting, the bpf memory is > > > > > charged to the loader's memcg by default, that causes unexpected issues for > > > > > us. For instance, the container of the loader may be restarted after > > > > > pinning progs and maps, but the bpf memcg will be left and pinned on the > > > > > system. Once the loader's new generation container is started, the leftover > > > > > pages won't be charged to it. That inconsistent behavior will make trouble > > > > > for the memory resource management for this container. > > > > > > > > > > In the past few days, I have proposed two patchsets[1][2] to try to resolve > > > > > this issue, but in both of these two proposals the user code has to be > > > > > changed to adapt to it, that is a pain for us. This patchset relieves the > > > > > pain by triggering the recharge in libbpf. It also addresses Roman's > > > > > critical comments. > > > > > > > > > > The key point we can avoid changing the user code is that there's a resue > > > > > path in libbpf. Once the bpf container is restarted again, it will try > > > > > to re-run the required bpf programs, if the bpf programs are the same with > > > > > the already pinned one, it will reuse them. > > > > > > > > > > To make sure we either recharge all of them successfully or don't recharge > > > > > any of them. The recharge prograss is divided into three steps: > > > > > - Pre charge to the new generation > > > > > To make sure once we uncharge from the old generation, we can always > > > > > charge to the new generation succeesfully. If we can't pre charge to > > > > > the new generation, we won't allow it to be uncharged from the old > > > > > generation. > > > > > - Uncharge from the old generation > > > > > After pre charge to the new generation, we can uncharge from the old > > > > > generation. > > > > > - Post charge to the new generation > > > > > Finnaly we can set pages' memcg_data to the new generation. > > > > > In the pre charge step, we may succeed to charge some addresses, but fail > > > > > to charge a new address, then we should uncharge the already charged > > > > > addresses, so another recharge-err step is instroduced. > > > > > > > > > > This pachset has finished recharging bpf hash map. which is mostly used > > > > > by our bpf services. The other maps hasn't been implemented yet. The bpf > > > > > progs hasn't been implemented neither. > > > > > > > > Without going into the implementation details, the overall approach looks > > > > ok to me. But it adds complexity and code into several different subsystems, > > > > and I'm 100% sure it's not worth it if we talking about a partial support > > > > of a single map type. Are you committed to implement the recharging > > > > for all/most map types and progs and support this code in the future? > > > > > > > > > > I'm planning to support it for all map types and progs. Regarding the > > > progs, it seems that we have to introduce a new UAPI for the user to > > > do the recharge, because there's no similar reuse path in libbpf. > > > > > > Our company is a heavy bpf user. We have many bpf programs running on > > > our production environment, including networking bpf, > > > tracing/profiling bpf, and some other bpf programs which are not > > > supported in upstream kernel, for example we're even trying the > > > sched-bpf[1] proposed by you (and you may remember that I reviewed > > > your patchset). Most of the networking bpf, e.g. gateway-bpf, > > > edt-bpf, loadbalance-bpf, veth-bpf, are pinned on the system. > > > > > > It is a trend that bpf will be introduced in more and more subsystems, > > > and thus it is no doubt that a bpf patchset will involve many > > > subsystems. > > > > > > That means I will be continuously active in these areas in the near > > > future, several years at least. > > > > Ok, I'm glad to hear this. I highly recommend to cover more map types > > and use cases in next iterations of the patchset. > > > > > > > > [1]. https://lwn.net/Articles/869433/ > > > > > > > I'm still feeling you trying to solve a userspace problem in the kernel. > > > > > > Your feeling can be converted to a simple question: is it allowed to > > > pin a bpf program by a process running in a container. The answer to > > > this simple question can help us to understand whether it is a user > > > bug or a kernel bug. > > > > > > I think you will agree with me that there's definitely no reason to > > > refuse to pin a bpf program by a containerized process. And then we > > > will find that the pinned-bpf-program doesn't cooperate well with > > > memcg. A kernel feature can't work together with another kernel > > > feature, and there's not even an interface provided to the user to > > > adjust it. The user either doesn't pin the bpf program or disable > > > kmemcg. Isn't it a kernel bug ? > > > > > > You may have a doubt why these two features can't cooperate. I will > > > explain it in detail. That will be a long story. > > > > > > It should begin with why we introduce bpf pinning. We pin it because > > > sometimes the lifecycle of a user application is different with the > > > bpf program, or there's no user agent at all. In order to make it > > > simple, I will take the no-user-agent (agent exits after pinning bpf > > > program) case as an example. > > > > > > Now thinking about what will happen if the agent which pins the bpf > > > program has a memcg. No matter if the agent destroys the memcg or not > > > once it exits, the memcg will not disappear because it is pinned by > > > the bpf program. To make it easy, let's assume the memcg isn't being > > > destroyed, IOW, it is online. > > > > > > An online memcg is not populated, but it is still being remote charged > > > (if it is a non-preallocate bpf map), that looks like a ghost. Now we > > > will look into the details to find what will happen to this ghost > > > memcg. > > > > > > If this ghost memcg is limited, it will introduce many issues AFAICS. > > > Firstly, the memcg will be force charged[2], and I think I don't need > > > to explain the reason to you. > > > Even worse is that it force-charges silently without any event, > > > because it comes from, > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > goto nomem; > > > And then all memcg events will be skipped. So at least we will > > > introduce a force-charge event, > > > force: > > > + memcg_memory_event(mem_over_limit, MEMCG_FORCE_CHARGE); > > > page_counter_charge(&memcg->memory, nr_pages); > > > > This is actually a good point, let me try to fix it. > > > > > > > > And then we should allow alloc_htab_elem() to fail, > > > l_new = bpf_map_kmalloc_node(&htab->map, htab->elem_size, > > > - GFP_ATOMIC | __GFP_NOWARN, > > > + __GFP_ATOMIC | > > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN, > > > > It's not a memcg thing, it was done this way previously. Probably Alexei > > has a better explanation. Personally, I'm totally fine with removing > > __GFP_NOWARN, but maybe I just don't know something. > > > > > htab->map.numa_node); > > > And then we'd better introduce an improvement for memcg, > > > + /* > > > + * Should wakeup async memcg reclaim first, > > > + * in case there will be no direct memcg reclaim for a long time. > > > + * We can either introduce async memcg reclaim > > > + * or modify kswapd to reclaim a specific memcg > > > + */ > > > + if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > > + wake_up_async_memcg_reclaim(); > > > if (!gfpflags_allow_blocking(gfp_mask)) > > > goto nomem; > > > > Hm, I see. It might be an issue if there is no global memory pressure, right? > > Let me think what I can do here too. > > > > > > > > And ..... > > > > > > Really bad luck that there are so many issues in memcg, but it may > > > also be because I don't have a deep understanding of memcg ...... > > > > > > I have to clarify that these issues are not caused by > > > memcg-based-bpf-accounting, but exposed by it. > > > > > > [ Time for lunch here, so I have to stop. ] > > > > Thank you for writing this text, it was interesting to follow your thinking. > > And thank you for bringing in these problems above. > > > > Let me be clear: I'm not opposing the idea of recharging, I'm only against > > introducing hacks for bpf-specific issues, which can't be nicely generalized > > for other use cases and subsystems. That's the only reason why I'm a bit > > defensive here. > > > > In general, as now memory cgroups do not provide an ability to recharge > > accounted objects (with some exceptions from the v1 legacy). It applies > > both to user and kernel memory. I agree, that bpf maps are in some sense > > unique, as they are potentially large kernel objects with a lot of control > > from the userspace. Is this a good reason to extend memory cgroup API > > with the recharging ability? Maybe, but if yes, let's do it well. > > > > The big question is how to do it? Memcg accounting is done in a way > > that requires little changes from the kernel code, right? You just > > add __GFP_ACCOUNT to gfp flags and that's it, you get a pointer to > > an already accounted object. The same applies for uncharging. > > It works transparently. > > > > Recharging is different: a caller should have some sort of the ownership > > over the object (to make sure we are not racing against the reclaim and/or > > another recharging). And the rules are different for each type of objects. > > It's a caller duty to make sure all parts of the complex object are properly > > recharged and nothing is left behind. There is also the reparenting mechanism > > which can race against the recharging. So it's not an easy problem. > > If an object is large, we probably don't want to recharge it at once, > > otherwise temporarily doubling of the accounted memory (thanks to the > > precharge-uncharge-commit approach) risks introducing spurious OOMs > > on memory-limited systems. > > > > So yeah, if it doesn't sound too scary for you, I'm happy to help > > with this. But it's a lot of work to do it properly, that's why I'm thinking > > that maybe it's better to workaround it in userspace, as Alexei suggested. > > And as Alexei mentioned, there are some changes coming around the way > how memory allocations will be usually performed by the bpf code, which might > make the whole problem and/or your solution obsolete. Please, make sure it's > still actual before sending the next version. > I have taken a look at Alexei's patchset. For the non-preallocated bpf memory, we have to use map->memcg to do the accounting, then this issue will still exist. It also reminds us that when we reuse a map we must make sure the map->memcg is the expected one. -- Regards Yafang