Hi Johannes, On Mon, Nov 22, 2021 at 11:04 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > [...] > Here is one idea: > > Have you considered reparenting pages that are accessed by multiple > cgroups to the first common ancestor of those groups? > > Essentially, whenever there is a memory access (minor fault, buffered > IO) to a page that doesn't belong to the accessing task's cgroup, you > find the common ancestor between that task and the owning cgroup, and > move the page there. > > With a tree like this: > > root - job group - job > `- job > `- job group - job > `- job > > all pages accessed inside that tree will propagate to the highest > level at which they are shared - which is the same level where you'd > also set shared policies, like a job group memory limit or io weight. > > E.g. libc pages would (likely) bubble to the root, persistent tmpfs > pages would bubble to the respective job group, private data would > stay within each job. > > No further user configuration necessary. Although you still *can* use > mount namespacing etc. to prohibit undesired sharing between cgroups. > > The actual user-visible accounting change would be quite small, and > arguably much more intuitive. Remember that accounting is recursive, > meaning that a job page today also shows up in the counters of job > group and root. This would not change. The only thing that IS weird > today is that when two jobs share a page, it will arbitrarily show up > in one job's counter but not in the other's. That would change: it > would no longer show up as either, since it's not private to either; > it would just be a job group (and up) page. > > This would be a generic implementation of resource sharing semantics: > independent of data source and filesystems, contained inside the > cgroup interface, and reusing the existing hierarchies of accounting > and control domains to also represent levels of common property. > > Thoughts? Before commenting on your proposal, I would like to clarify that the use-cases given are not specific to us but are more general. Though I think you are arguing that the implementation is not general purpose which I kind of agree with. Let me take a stab again at describing these use-cases which I think can be partitioned based on the relationship of the entities sharing/accessing the memory among them. (Sorry for repeating these because I think we should keep these in mind while discussing the possible solutions). 1) Mutually trusted entities sharing memory for collaborative work. One example is a file-system shared between sub-tasks of a meta-job. (Mina's second use-case). 2) Independent entities sharing memory to reduce cost. Examples include shared libraries, packages or tool chains. (Mina's third use-case). 3) One entity observing or monitoring another entity. Examples include gdb, ptrace, uprobes, VM or process migration and checkpointing. 4) Server-Client relationship. (Mina's first use-case. Let me put (3) out of the way first as these operations have special interfaces and the target entity is a process (not a cgroup). Remote charging works for these and no new oom corner cases are introduced. For (1) and (2), I think your proposal aligns pretty well with them but one important property is still missing which we are very adamant about i.e. 'deterministic charge'. To explain with an example, suppose two instances of the same job are running on two different systems. On one system, it is sharing a shared library with an unrelated job and the second instance is using that library alone. The owner will see different memory usage for both instances which can mess with their resource planning. However I think this can be solved very easily with an opt-in add-on. The node controller knows upfront the libraries/packages which can be shared between the jobs and is responsible for creating the cgroup hierarchy (at least the top level) for the jobs. It can create a common ancestor for all such jobs and let the kernel know that if any descendant accesses these libraries, charge to this specific ancestor. If someone out of this sub-hierarchy accesses the memory, follow the proposal i.e. common ancestor. With this specific opt-in add-on, all job owners will see their job usage more consistent. [I am putting this as a brainstorming discussion] Regarding (4), for our use-case, the server wants the cost of the memory needed to serve a client to be paid by the corresponding client. Please note that the memory is not necessarily accessed by the client. Now we can argue that this use-case can be served similar to (3) i.e. through a special interface/syscall. I think that would be challenging particularly when the lifetime of a client 'process' is independent of the memory needed to serve that client. Another way is to disable the accounting of that specific memory needed to serve the clients (I think Roman suggested a similar notion as disabling accounting of a tmpfs). Any other ideas? thanks, Shakeel