Re: [PATCH bpf-next v4 3/7] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

Roman Gushchin <roman.gushchin@xxxxxxxxx> · Mon, 24 Oct 2022 19:38:36 -0700

On Mon, Oct 24, 2022 at 05:55:17PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 24, 2022 at 5:48 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Mon, Oct 24, 2022 at 5:32 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> > >
> > > On 10/24/22 5:21 PM, Yosry Ahmed wrote:
> > > > On Mon, Oct 24, 2022 at 2:15 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> > > >>
> > > >> On 10/23/22 11:05 AM, Yonghong Song wrote:
> > > >>> +void bpf_cgrp_storage_free(struct cgroup *cgroup)
> > > >>> +{
> > > >>> +     struct bpf_local_storage *local_storage;
> > > >>> +     struct bpf_local_storage_elem *selem;
> > > >>> +     bool free_cgroup_storage = false;
> > > >>> +     struct hlist_node *n;
> > > >>> +     unsigned long flags;
> > > >>> +
> > > >>> +     rcu_read_lock();
> > > >>> +     local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
> > > >>> +     if (!local_storage) {
> > > >>> +             rcu_read_unlock();
> > > >>> +             return;
> > > >>> +     }
> > > >>> +
> > > >>> +     /* Neither the bpf_prog nor the bpf_map's syscall
> > > >>> +      * could be modifying the local_storage->list now.
> > > >>> +      * Thus, no elem can be added to or deleted from the
> > > >>> +      * local_storage->list by the bpf_prog or by the bpf_map's syscall.
> > > >>> +      *
> > > >>> +      * It is racing with __bpf_local_storage_map_free() alone
> > > >>> +      * when unlinking elem from the local_storage->list and
> > > >>> +      * the map's bucket->list.
> > > >>> +      */
> > > >>> +     bpf_cgrp_storage_lock();
> > > >>> +     raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > >>> +     hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > >>> +             bpf_selem_unlink_map(selem);
> > > >>> +             /* If local_storage list has only one element, the
> > > >>> +              * bpf_selem_unlink_storage_nolock() will return true.
> > > >>> +              * Otherwise, it will return false. The current loop iteration
> > > >>> +              * intends to remove all local storage. So the last iteration
> > > >>> +              * of the loop will set the free_cgroup_storage to true.
> > > >>> +              */
> > > >>> +             free_cgroup_storage =
> > > >>> +                     bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > >>> +     }
> > > >>> +     raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > >>> +     bpf_cgrp_storage_unlock();
> > > >>> +     rcu_read_unlock();
> > > >>> +
> > > >>> +     if (free_cgroup_storage)
> > > >>> +             kfree_rcu(local_storage, rcu);
> > > >>> +}
> > > >>
> > > >> [ ... ]
> > > >>
> > > >>> +/* *gfp_flags* is a hidden argument provided by the verifier */
> > > >>> +BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
> > > >>> +        void *, value, u64, flags, gfp_t, gfp_flags)
> > > >>> +{
> > > >>> +     struct bpf_local_storage_data *sdata;
> > > >>> +
> > > >>> +     WARN_ON_ONCE(!bpf_rcu_lock_held());
> > > >>> +     if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     if (!cgroup)
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     if (!bpf_cgrp_storage_trylock())
> > > >>> +             return (unsigned long)NULL;
> > > >>> +
> > > >>> +     sdata = cgroup_storage_lookup(cgroup, map, true);
> > > >>> +     if (sdata)
> > > >>> +             goto unlock;
> > > >>> +
> > > >>> +     /* only allocate new storage, when the cgroup is refcounted */
> > > >>> +     if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
> > > >>> +         (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
> > > >>> +             sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
> > > >>> +                                              value, BPF_NOEXIST, gfp_flags);
> > > >>> +
> > > >>> +unlock:
> > > >>> +     bpf_cgrp_storage_unlock();
> > > >>> +     return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
> > > >>> +}
> > > >>
> > > >> [ ... ]
> > > >>
> > > >>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > >>> index 764bdd5fd8d1..32145d066a09 100644
> > > >>> --- a/kernel/cgroup/cgroup.c
> > > >>> +++ b/kernel/cgroup/cgroup.c
> > > >>> @@ -5227,6 +5227,10 @@ static void css_free_rwork_fn(struct work_struct *work)
> > > >>>        struct cgroup_subsys *ss = css->ss;
> > > >>>        struct cgroup *cgrp = css->cgroup;
> > > >>>
> > > >>> +#ifdef CONFIG_BPF_SYSCALL
> > > >>> +     bpf_cgrp_storage_free(cgrp);
> > > >>> +#endif
> > > >>
> > > >>
> > > >> After revisiting comment 4bfc0bb2c60e, some of the commit message came to my mind:
> > > >>
> > > >> " ...... it blocks a possibility to implement
> > > >>     the memcg-based memory accounting for bpf objects, because a circular
> > > >>     reference dependency will occur. Charged memory pages are pinning the
> > > >>     corresponding memory cgroup, and if the memory cgroup is pinning
> > > >>     the attached bpf program, nothing will be ever released."
> > > >>
> > > >> Considering the bpf_map_kzalloc() is used in bpf_local_storage_map.c and it can
> > > >> charge the memcg, I wonder if the cgrp_local_storage will have similar refcnt
> > > >> loop issue here.
> > > >>
> > > >> If here is the right place to free the cgrp_local_storage() and enough to break
> > > >> this refcnt loop, it will be useful to add some explanation and its
> > > >> consideration in the commit message.
> > > >>
> > > >
> > > > I think a similar refcount loop issue can happen here as well. IIUC,
> > > > this function will only be run when the css is released after all
> > > > references are dropped. So if memcg charging is enabled the cgroup
> > > > will never be removed. This one might be trickier to handle though..
> > >
> > > How about removing all storage from cgrp->bpf_cgrp_storage in
> > > cgroup_destroy_locked()?
> >
> > The problem here is that you lose information for cgroups that went
> > offline but still exist in the kernel (i.e offline cgroups). The
> > commit log 4bfc0bb2c60e mentions that such cgroups can have live
> > sockets attached, so this might be a problem? From a memory
> > perspective, offline memcgs can still undergo memory operations like
> > reclaim. If we are using BPF to collect cgroup statistics for memory
> > reclaim, we can't do so for offline memcgs, which is not the end of
> > the world, but the cgroup storages become slightly less powerful. We
> > might also lose some data that we have already stored for such offline
> > memcgs. Also BPF programs now need to handle the case where they have
> > a valid cgroup pointer but they cannot retrieve a cgroup storage for
> > it because it went offline.
> >
> > We ideally want to be able to charge the memory to the cgroup without
> > holding a ref to it, which is against the cgroup memory charging
> > model.
> 
> +Roman Gushchin
> 
> Wait a second. I think Roman implemented reparenting of BPF map memcg
> charges when a memcg is offlined. In this case, when a cgroup goes
> offline (aka rmdir()'d), the charges will actually move to the parent,
> and the local storage will no longer be holding any refs to the
> cgroup. In that case, the current freeing location should be fine and
> the cgroup local storage can continue to work with offlined cgroups.
> 
> Roman, could you please confirm or deny my thesis?

Yes, it shouldn't be a problem anymore since we've implemented the recharging
of all main kernel memory types (slabs, large kmallocs, percpu).

Thanks!