On 11/13/2023 12:50 AM, Ondrej Mosnacek wrote: > On Mon, Nov 13, 2023 at 4:17 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: >> On Mon, Nov 13, 2023 at 12:45 AM Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote: >>> On 11/11/2023 11:34 PM, Yafang Shao wrote: >>>> Background >>>> ========== >>>> >>>> In our containerized environment, we've identified unexpected OOM events >>>> where the OOM-killer terminates tasks despite having ample free memory. >>>> This anomaly is traced back to tasks within a container using mbind(2) to >>>> bind memory to a specific NUMA node. When the allocated memory on this node >>>> is exhausted, the OOM-killer, prioritizing tasks based on oom_score, >>>> indiscriminately kills tasks. This becomes more critical with guaranteed >>>> tasks (oom_score_adj: -998) aggravating the issue. >>> Is there some reason why you can't fix the callers of mbind(2)? >>> This looks like an user space configuration error rather than a >>> system security issue. >> It appears my initial description may have caused confusion. In this >> scenario, the caller is an unprivileged user lacking any capabilities. >> While a privileged user, such as root, experiencing this issue might >> indicate a user space configuration error, the concerning aspect is >> the potential for an unprivileged user to disrupt the system easily. >> If this is perceived as a misconfiguration, the question arises: What >> is the correct configuration to prevent an unprivileged user from >> utilizing mbind(2)?" >> >>>> The selected victim might not have allocated memory on the same NUMA node, >>>> rendering the killing ineffective. This patch aims to address this by >>>> disabling MPOL_BIND in container environments. >>>> >>>> In the container environment, our aim is to consolidate memory resource >>>> control under the management of kubelet. If users express a preference for >>>> binding their memory to a specific NUMA node, we encourage the adoption of >>>> a standardized approach. Specifically, we recommend configuring this memory >>>> policy through kubelet using cpuset.mems in the cpuset controller, rather >>>> than individual users setting it autonomously. This centralized approach >>>> ensures that NUMA nodes are globally managed through kubelet, promoting >>>> consistency and facilitating streamlined administration of memory resources >>>> across the entire containerized environment. >>> Changing system behavior for a single use case doesn't seem prudent. >>> You're introducing a bunch of kernel code to avoid fixing a broken >>> user space configuration. >> Currently, there is no mechanism in place to proactively prevent an >> unprivileged user from utilizing mbind(2). The approach adopted is to >> monitor mbind(2) through a BPF program and trigger an alert if its >> usage is detected. However, beyond this monitoring, the only recourse >> is to verbally communicate with the user, advising against the use of >> mbind(2). As a result, users will question why mbind(2) isn't outright >> prohibited in the first place. > Is there a reason why you can't use syscall filtering via seccomp(2)? > AFAIK, all the mainstream container tooling already has support for > specifying seccomp filters for containers. That looks like a practical solution from here.