On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > This is pretty much the solution I sent out for review about a year > > ago and yes, it suffers from the issues you've brought up: > > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/ > > > > > > > One solution that I can think of is leveraging the resource domain > > > concept which is currently only used for threaded cgroups. All memory > > > usages of threaded cgroups are charged to their resource domain cgroup > > > which hosts the processes for those threads. The persistent usages have a > > > similar pattern, so maybe the service level cgroup can declare that it's > > > the encompassing resource domain and the instance cgroup can say whether > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > resource domain. > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > the tmpfs/bpf memory would get charged to the encompassing resource > > domain cgroup rather than the instance cgroup, making the memory usage > > of the first and second+ instances consistent and predictable. > > > > Would love to hear from other memcg folks what they would think of > > such an approach. I would also love to hear what kind of interface you > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > charge the tmpfs/bpf instance to itself or to the encompassing > > resource domain? > > I like this too. It makes shared charging predictable, with a coherent > resource hierarchy (congruent OOM, CPU, IO domains), and without the > need for cgroup paths in tmpfs mounts or similar. > > As far as who is declaring what goes, though: if the instance groups > can declare arbitrary files/objects persistent or shared, they'd be > able to abuse this and sneak private memory past local limits and > burden the wider persistent/shared domain with it. > > I'm thinking it might make more sense for the service level to declare > which objects are persistent and shared across instances. > > If that's the case, we may not need a two-component interface. Just > the ability for an intermediate cgroup to say: "This object's future > memory is to be charged to me, not the instantiating cgroup." > > Can we require a process in the intermediate cgroup to set up the file > or object, and use madvise/fadvise to say "charge me", before any > instances are launched? I think doing this on a file granularity makes it logistically hard to use, no? The service needs to create a file in the shared domain and all its instances need to re-use this exact same file. Our kubernetes use case from [1] shares a mount between subtasks rather than specific files. This allows subtasks to create files at will in the mount with the memory charged to the shared domain. I imagine this is more convenient than a shared file. Our other use case, which I hope to address here as well, is a service-client relationship from [1] where the service would like to charge per-client memory back to the client itself. In this case the service or client can create a mount from the shared domain and pass it to the service at which point the service is free to create/remove files in this mount as it sees fit. Would you be open to a per-mount interface rather than a per-file fadvise interface? Yosry, would a proposal like so be extensible to address the bpf charging issues? [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@xxxxxxxxxx/