On Mon, Nov 13, 2023 at 4:32 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > On Sun, Nov 12, 2023 at 2:35 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > Background > > ========== > > > > In our containerized environment, we've identified unexpected OOM events > > where the OOM-killer terminates tasks despite having ample free memory. > > This anomaly is traced back to tasks within a container using mbind(2) to > > bind memory to a specific NUMA node. When the allocated memory on this node > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > indiscriminately kills tasks. This becomes more critical with guaranteed > > tasks (oom_score_adj: -998) aggravating the issue. > > > > The selected victim might not have allocated memory on the same NUMA node, > > rendering the killing ineffective. This patch aims to address this by > > disabling MPOL_BIND in container environments. > > > > In the container environment, our aim is to consolidate memory resource > > control under the management of kubelet. If users express a preference for > > binding their memory to a specific NUMA node, we encourage the adoption of > > a standardized approach. Specifically, we recommend configuring this memory > > policy through kubelet using cpuset.mems in the cpuset controller, rather > > than individual users setting it autonomously. This centralized approach > > ensures that NUMA nodes are globally managed through kubelet, promoting > > consistency and facilitating streamlined administration of memory resources > > across the entire containerized environment. > > > > Proposed Solutions > > ================= > > > > - Introduce Capability to Disable MPOL_BIND > > Currently, any task can perform MPOL_BIND without specific capabilities. > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > > may have unintended consequences. Capabilities, being broad, might grant > > unnecessary privileges. We should explore alternatives to prevent > > unexpected side effects. > > > > - Use LSM BPF to Disable MPOL_BIND > > Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and > > set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more > > flexibility and allows for fine-grained control without unintended > > consequences. A sample LSM BPF program is included, demonstrating > > practical implementation in a production environment. > > Without looking at the patchset in any detail yet, I wanted to point > out that we do have some documented guidelines for adding new LSM > hooks: > > https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hook-guidelines > > I just learned that there are provisions for adding this to the > MAINTAINERS file, I'll be doing that shortly. My apologies for not > having it in there sooner. Thanks for your information. I will learn it carefully. -- Regards Yafang