On Wed, Jul 06, 2022 at 03:11:46PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 6, 2022 at 12:09 PM Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote: > > > > On Wed, Jul 06, 2022 at 09:47:32AM -0700, Alexei Starovoitov wrote: > > > On Wed, Jul 6, 2022 at 8:59 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > > > GFP_ATOMIC doesn't cooperate well with memcg pressure so far, especially > > > > if we allocate too much GFP_ATOMIC memory. For example, when we set the > > > > memcg limit to limit a non-preallocated bpf memory, the GFP_ATOMIC can > > > > easily break the memcg limit by force charge. So it is very dangerous to > > > > use GFP_ATOMIC in non-preallocated case. One way to make it safe is to > > > > remove __GFP_HIGH from GFP_ATOMIC, IOW, use (__GFP_ATOMIC | > > > > __GFP_KSWAPD_RECLAIM) instead, then it will be limited if we allocate > > > > too much memory. > > > > > > > > We introduced BPF_F_NO_PREALLOC is because full map pre-allocation is > > > > too memory expensive for some cases. That means removing __GFP_HIGH > > > > doesn't break the rule of BPF_F_NO_PREALLOC, but has the same goal with > > > > it-avoiding issues caused by too much memory. So let's remove it. > > > > > > > > The force charge of GFP_ATOMIC was introduced in > > > > commit 869712fd3de5 ("mm: memcontrol: fix network errors from failing > > > > __GFP_ATOMIC charges") by checking __GFP_ATOMIC, then got improved in > > > > commit 1461e8c2b6af ("memcg: unify force charging conditions") by > > > > checking __GFP_HIGH (that is no problem because both __GFP_HIGH and > > > > __GFP_ATOMIC are set in GFP_AOMIC). So, if we want to fix it in memcg, > > > > we have to carefully verify all the callsites. Now that we can fix it in > > > > BPF, we'd better not modify the memcg code. > > > > > > > > This fix can also apply to other run-time allocations, for example, the > > > > allocation in lpm trie, local storage and devmap. So let fix it > > > > consistently over the bpf code > > > > > > > > __GFP_KSWAPD_RECLAIM doesn't cooperate well with memcg pressure neither > > > > currently. But the memcg code can be improved to make > > > > __GFP_KSWAPD_RECLAIM work well under memcg pressure if desired. > > > > > > Could you elaborate ? > > > > > > > It also fixes a typo in the comment. > > > > > > > > Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx> > > > > Reviewed-by: Roman Gushchin <roman.gushchin@xxxxxxxxx> > > > > > > Roman, do you agree with this change ? > > > > Yes, removing __GFP_HIGH makes sense to me. I can imagine we might want > > it for *some* bpf allocations, but applying it unconditionally looks wrong. > > Yeah. It's a difficult trade-off to make without having the data > to decide whether removing __GFP_HIGH can cause issues or not, Yeah, the change looks reasonable, but it's hard to say without giving it a good testing in (something close to) a production environment. > but do you agree that __GFP_HIGH doesn't cooperate well with memcg ? > If so it's a bug on memcg side, right? No. Historically we allowed high-prio allocations to exceed the memcg limit because otherwise there were too many stability and performance issues. It's not a memcg bug, it's a way to avoid exposing ENOMEM handling bugs all over the kernel code. Without memory cgroups GFP_ATOMIC allocations rarely fail and a lot of code paths in the kernel are not really ready for it (at least it was the case several years ago, maybe things are better now). But it was usually thought in the context of small(ish) allocations which do not change the global memory usage picture. Subsequent "normal" allocations are triggering reclaim/OOM, so from a user's POV the limit works as expected. But with the ownership model and size of bpf maps it's a different story: if a bpf map belongs to an abandoned cgroup, it might consume a lot of memory and there will be no "normal" allocations. So cgroup memory limit will be never applied. It's a valid issue, I agree with Yafang here. > but we should probably > apply this band-aid on bpf side to fix the bleeding. > Later we can add a knob to allow __GFP_HIGH usage on demand from > bpf prog. Yes, it sounds like a good idea. I have to think what's the best approach here, it's not obvious for me. Thanks!