Re: [RFC PATCH -mm 0/4] mm, security, bpf: Fine-grained control over memory policy adjustments with lsm bpf

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 14 Nov 2023 19:59:53 +0800

On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 13-11-23 11:15:06, Yafang Shao wrote:
> > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > On 11/11/2023 11:34 PM, Yafang Shao wrote:
> > > > Background
> > > > ==========
> > > >
> > > > In our containerized environment, we've identified unexpected OOM events
> > > > where the OOM-killer terminates tasks despite having ample free memory.
> > > > This anomaly is traced back to tasks within a container using mbind(2) to
> > > > bind memory to a specific NUMA node. When the allocated memory on this node
> > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score,
> > > > indiscriminately kills tasks. This becomes more critical with guaranteed
> > > > tasks (oom_score_adj: -998) aggravating the issue.
> > >
> > > Is there some reason why you can't fix the callers of mbind(2)?
> > > This looks like an user space configuration error rather than a
> > > system security issue.
> >
> > It appears my initial description may have caused confusion. In this
> > scenario, the caller is an unprivileged user lacking any capabilities.
> > While a privileged user, such as root, experiencing this issue might
> > indicate a user space configuration error, the concerning aspect is
> > the potential for an unprivileged user to disrupt the system easily.
> > If this is perceived as a misconfiguration, the question arises: What
> > is the correct configuration to prevent an unprivileged user from
> > utilizing mbind(2)?"
>
> How is this any different than a non NUMA (mbind) situation?

In a UMA system, each gigabyte of memory carries the same cost.
Conversely, in a NUMA architecture, opting to confine processes within
a specific NUMA node incurs additional costs. In the worst-case
scenario, if all containers opt to bind their memory exclusively to
specific nodes, it will result in significant memory wastage.

> You can
> still have an unprivileged user to allocate just until the OOM triggers
> and disrupt other workload consuming more memory. Sure the mempolicy
> based OOM is less precise and it might select a victim with only a small
> consumption on a target NUMA node but fundamentally the situation is
> very similar. I do not think disallowing mbind specifically is solving a
> real problem.

How would you recommend addressing this more effectively?

-- 
Regards
Yafang