On Tue, Nov 14, 2023 at 6:15 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Mon 13-11-23 11:15:06, Yafang Shao wrote: > > On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote: > > > > > > On 11/11/2023 11:34 PM, Yafang Shao wrote: > > > > Background > > > > ========== > > > > > > > > In our containerized environment, we've identified unexpected OOM events > > > > where the OOM-killer terminates tasks despite having ample free memory. > > > > This anomaly is traced back to tasks within a container using mbind(2) to > > > > bind memory to a specific NUMA node. When the allocated memory on this node > > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > > indiscriminately kills tasks. This becomes more critical with guaranteed > > > > tasks (oom_score_adj: -998) aggravating the issue. > > > > > > Is there some reason why you can't fix the callers of mbind(2)? > > > This looks like an user space configuration error rather than a > > > system security issue. > > > > It appears my initial description may have caused confusion. In this > > scenario, the caller is an unprivileged user lacking any capabilities. > > While a privileged user, such as root, experiencing this issue might > > indicate a user space configuration error, the concerning aspect is > > the potential for an unprivileged user to disrupt the system easily. > > If this is perceived as a misconfiguration, the question arises: What > > is the correct configuration to prevent an unprivileged user from > > utilizing mbind(2)?" > > How is this any different than a non NUMA (mbind) situation? In a UMA system, each gigabyte of memory carries the same cost. Conversely, in a NUMA architecture, opting to confine processes within a specific NUMA node incurs additional costs. In the worst-case scenario, if all containers opt to bind their memory exclusively to specific nodes, it will result in significant memory wastage. > You can > still have an unprivileged user to allocate just until the OOM triggers > and disrupt other workload consuming more memory. Sure the mempolicy > based OOM is less precise and it might select a victim with only a small > consumption on a target NUMA node but fundamentally the situation is > very similar. I do not think disallowing mbind specifically is solving a > real problem. How would you recommend addressing this more effectively? -- Regards Yafang