Re: [PATCH v3 2/4] mm/oom: handle remote ooms

Mina Almasry <almasrymina@xxxxxxxxxx> · Fri, 19 Nov 2021 14:32:12 -0800

On Thu, Nov 18, 2021 at 12:47 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Tue 16-11-21 13:27:34, Mina Almasry wrote:
> > On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> [...]
> > > Can you elaborate some more? How do you enforce that the mount point
> > > cannot be accessed by anybody outside of that constraint?
> >
> > So if I'm a bad actor that wants to intentionally DoS random memcgs on
> > the system I can:
> >
> > mount -t tmpfs -o memcg=/sys/fs/cgroup/unified/memcg-to-dos tmpfs /mnt/tmpfs
> > cat /dev/random > /mnt/tmpfs
>
> If you can mount tmpfs then you do not need to fiddle with memcgs at
> all. You just DoS the whole machine. That is not what I was asking
> though.
>
> My question was more towards a difference scenario. How do you
> prevent random processes to _write_ to those mount points? User/group
> permissions might be just too coarse to describe memcg relation. Without
> memcg in place somebody could cause ENOSPC to the mount point users
> and that is not great either but that should be recoverable to some
> degree. With memcg configuration this would cause the memcg OOM which
> would be harder to recover from because it affects all memcg charges in
> that cgroup - not just that specific fs access. See what I mean? This is
> a completely new failure mode.
>
> The only reasonable way would be to reduce the visibility of that mount
> point. This is certainly possible but it seems rather awkward when it
> should be accessible from multiple resource domains.
>

So the problem of preventing random processes from writing to a mount
point is a generic problem on machine configurations where you have
untrusted code running on the machine, which is a very common case.
For us we have any number of random workloads or VMs running on the
machine and it's critical to limit their credentials to exactly what
these workloads need. Because of this, regardless of whether the
filesystem is mounted with memcg= or not, the write/execute/read
permissions are only given to those that need access to the mount
point. If this is not done correctly, there are potentially even more
serious problems than causing OOMs or SIGBUSes to users of the mount
point.

Because this is a generic problem, it's addressed 'elsewhere'. I'm
honestly not extremely familiar but my rough understanding is that
there are linux filesystem permissions and user namespaces to address
this, and there are also higher level constructs like containerd which
which limits the visibility of jobs running on the system. My
understanding is that there are also sandboxes which go well beyond
limiting file access permissions.

To speak more concretely, for the 3 use cases I mention in the RFC
proposal (I'll attach that as cover letter in the next version):
1. For services running on the system, the shared tmpfs mount is only
visible and accessible (write/read) to the network service and its
client.
2. For large jobs with subprocesses that share memory like kubernetes,
the shared tmpfs is again only visible and accessible to the processes
in this job.
3. For filesystems that host shared libraries, it's a big no-no to
give anyone on the machine write permissions to the runtime AFAIU, so
I expect the mount point to be read-only.

Note that all these restrictions should and would be in place
regardless of whether the kernel supports the memcg= option or the
filesystem is mounted with memcg=. I'm not extremely familiar with the
implementation details on these restrictions, but I can grab them.

> I cannot really shake off feeling that this is potentially adding more
> problems than it solves.
> --
> Michal Hocko
> SUSE Labs

On Thu, Nov 18, 2021 at 12:48 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Tue 16-11-21 13:55:54, Shakeel Butt wrote:
> > On Tue, Nov 16, 2021 at 1:27 PM Mina Almasry <almasrymina@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Nov 16, 2021 at 3:29 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > [...]
> > > > Yes, exactly. I meant that all this special casing would be done at the
> > > > shmem layer as it knows how to communicate this usecase.
> > > >
> > >
> > > Awesome. The more I think of it I think the ENOSPC handling is perfect
> > > for this use case, because it gives all users of the shared memory and
> > > remote chargers a chance to gracefully handle the ENOSPC or the SIGBUS
> > > when we hit the nothing to kill case. The only issue is finding a
> > > clean implementation, and if the implementation I just proposed sounds
> > > good to you then I see no issues and I'm happy to submit this in the
> > > next version. Shakeel and others I would love to know what you think
> > > either now or when I post the next version.
> > >
> >
> > The direction seems reasonable to me. I would have more comments on
> > the actual code. At the high level I would prefer not to expose these
> > cases in the filesystem code (shmem or others) and instead be done in
> > a new memcg interface for filesystem users.
>
> A library like function in the memcg proper sounds good to me I just
> want to avoid any special casing in the core of the memcg charging and
> special casing there.
>

Yes, this is the implementation I'm working on and I'll submit another version.

> --
> Michal Hocko
> SUSE Labs