Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.)

Michal Hocko <mhocko@xxxxxxxx> · Wed, 20 Jul 2022 14:26:51 +0200

On Tue 19-07-22 11:46:41, Mina Almasry wrote:
[...]
> An interface like cgroup.sticky.[bpf/tmpfs/..] would work for us
> similar to tmpfs memcg= mount option. I would maybe rename it to
> cgroup.charge_for.[bpf/tmpfs/etc] or something.
> 
> With regards to OOM, my proposal on this patchset is to return ENOSPC
> to the caller if we hit the limit of the remote memcg and there is
> nothing to kill:
> https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/

That would imply SIGBUS on the #PF path. Is this really the way how we
want to tell userspace that something they are not aware of like a limit
in a completely different resource domain has triggered?

> There is some precedent to doing this in the kernel. If a hugetlb
> allocation hits the hugetlb_cgroup limit, we return ENOSPC to the
> caller (and SIGBUS in the charge path). The reason there being that we
> don't support oom-kill or reclaim or swap for hugetlb pages.

Following hugetlb is not really a great idea because hugetlb has always
been quite special and its users are aware of that. The same doesn't
really apply to other resources like tmpfs.

> I think it is also reasonable to prevent removing the memcg if there
> is cgroup.charge_for.[bpf/tmpfs/etc] still alive. Currently we prevent
> removing the memcg if there are tasks attached. So we can also prevent
> removing the memcg if there are bpf/tmpfs charge sources pending.

I can imagine some way of keeping cgroups active even without tasks but
so far I haven't really seen a good way how to achieve that.

cgroup.sticky.[bpf/tmpfs/..] interface is really weird if you ask me.
For one thing I have hard time imagine how to identify those resources.
tmpfs by path is really strange because the same mount point can be
referenced through many paths. Not the mention the path can be
remounted/redirected to anything after the configurion which would just
lead to a lot of confusion.

Exposing internal ids is also far from great. It would also put an
additional burden on the kernel implementation to ensure there is no
overlap in resources among different cgroups.  Also how many of those
sticky resources do we want to grow?

To me this have way too many red flags that it sounds like an interface
which would break really easily.

The more I think about that the more I agree with Tejun that corner
cases are just waiting to jump out at us. 
-- 
Michal Hocko
SUSE Labs