Re: [PATCH v4 0/4] Deterministic charging of shared memory

Roman Gushchin <guro@xxxxxx> · Mon, 22 Nov 2021 15:09:26 -0800

On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> > 
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> > 
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> > 
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> > 
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
> > 
> > Caveats:
> > 
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> > 
> > 1. It's not great to kill the allocating process since the allocating process
> >    is not running in the memcg under oom, and killing it will not free memory
> >    in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> >    somehow. If not, the process will forever loop the pagefault in the upstream
> >    kernel.
> > 
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path.  This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> >   abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> >   operation without executing the remote charge if possible.
> > 
> > 2. Only processes allowed the enter cgroup at mount time can mount a
> > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > process with write access to this mount point will be able to charge memory to
> > <cgroup>. This is largely a non-issue because in configurations where there is
> > untrusted code running on the machine, mount point access needs to be
> > restricted to the intended users only regardless of whether the mount point
> > memory is deterministly charged or not.
> 
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
> 
> 1. You need a filesystem interface in the first place, and a new
>    ad-hoc channel and permission model to coordinate with the cgroup
>    tree, which isn't great. All filesystems you want to share data on
>    need to be converted.
> 
> 2. It doesn't extend to non-filesystem sources of shared data, such as
>    memfds, ipc shm etc.
> 
> 3. It requires unintuitive configuration for what should be basic
>    shared accounting semantics. Per default you still get the old
>    'first touch' semantics, but to get sharing you need to reconfigure
>    the filesystems?
> 
> 4. If a task needs to work with a hierarchy of data sharing domains -
>    system-wide, group of jobs, job - it must interact with a hierarchy
>    of filesystem mounts. This is a pain to setup and may require task
>    awareness. Moving data around, working with different mount points.
>    Also, no shared and private data accounting within the same file.
> 
> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
>    entangled, sitting in disjunct domains. OOM killing is one quirk of
>    that, but there are others you haven't touched on. Who is charged
>    for the CPU cycles of reclaim in the out-of-band domain?  Who is
>    charged for the paging IO? How is resource pressure accounted and
>    attributed? Soon you need cpu= and io= as well.
> 
> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.
> 
> 
> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
> 
> Here is one idea:
> 
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
> 
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
> 
> With a tree like this:
> 
> 	root - job group - job
>                         `- job
>             `- job group - job
>                         `- job
> 
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
> 
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
> 
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
> 
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.

In general I like the idea, but I think the user-visible change will be quite
large, almost "cgroup v3"-large. Here are some problems:
1) Anything shared between e.g. system.slice and user.slice now belongs
   to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
   belonging to shared libraries.
2) It's concerning in security terms. If I understand the idea correctly, a
   read-only access will allow to move charges to an upper level, potentially
   crossing memory.max limits. It doesn't sound safe.
3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
   it returns us to the cgroup v1 world and a question of competition between
   resources consumed by a cgroup directly and through children cgroups. Not
   like the problem doesn't exist now, but it's less pronounced.
   If say >50% of system.slice's memory will belong to system.slice directly,
   then we likely will need separate non-recursive counters, limits, protections,
   etc.
4) Imagine a production server and a system administrator entering using ssh
   (and being put into user.slice) and running a big grep... It screws up all
   memory accounting until a next reboot. Not a completely impossible scenario.

That said, I agree with Johannes and I'm also not a big fan of this patchset.

I agree that the problem exist and that the patchset provides a solution, but
it doesn't look nice (and generic enough) and creates a lot of questions and
corner cases.

Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your
problem? It will be less invasive and will not require any oom changes.