On 11/11/2023 11:34 PM, Yafang Shao wrote: > Background > ========== > > In our containerized environment, we've identified unexpected OOM events > where the OOM-killer terminates tasks despite having ample free memory. > This anomaly is traced back to tasks within a container using mbind(2) to > bind memory to a specific NUMA node. When the allocated memory on this node > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > indiscriminately kills tasks. This becomes more critical with guaranteed > tasks (oom_score_adj: -998) aggravating the issue. Is there some reason why you can't fix the callers of mbind(2)? This looks like an user space configuration error rather than a system security issue. > > The selected victim might not have allocated memory on the same NUMA node, > rendering the killing ineffective. This patch aims to address this by > disabling MPOL_BIND in container environments. > > In the container environment, our aim is to consolidate memory resource > control under the management of kubelet. If users express a preference for > binding their memory to a specific NUMA node, we encourage the adoption of > a standardized approach. Specifically, we recommend configuring this memory > policy through kubelet using cpuset.mems in the cpuset controller, rather > than individual users setting it autonomously. This centralized approach > ensures that NUMA nodes are globally managed through kubelet, promoting > consistency and facilitating streamlined administration of memory resources > across the entire containerized environment. Changing system behavior for a single use case doesn't seem prudent. You're introducing a bunch of kernel code to avoid fixing a broken user space configuration. > > Proposed Solutions > ================= > > - Introduce Capability to Disable MPOL_BIND > Currently, any task can perform MPOL_BIND without specific capabilities. > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > may have unintended consequences. Capabilities, being broad, might grant > unnecessary privileges. We should explore alternatives to prevent > unexpected side effects. > > - Use LSM BPF to Disable MPOL_BIND > Introduce LSM hooks for syscalls such as mbind(2), set_mempolicy(2), and > set_mempolicy_home_node(2) to disable MPOL_BIND. This approach is more > flexibility and allows for fine-grained control without unintended > consequences. A sample LSM BPF program is included, demonstrating > practical implementation in a production environment. > > Future Considerations > ===================== > > In addition, there's room for enhancement in the OOM-killer for cases > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > prioritize selecting a victim that has allocated memory on the same NUMA > node. My exploration on the lore led me to a proposal[0] related to this > matter, although consensus seems elusive at this point. Nevertheless, > delving into this specific topic is beyond the scope of the current > patchset. > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@xxxxxxxxxxxxx/ > > Yafang Shao (4): > mm, security: Add lsm hook for mbind(2) > mm, security: Add lsm hook for set_mempolicy(2) > mm, security: Add lsm hook for set_mempolicy_home_node(2) > selftests/bpf: Add selftests for mbind(2) with lsm prog > > include/linux/lsm_hook_defs.h | 8 +++ > include/linux/security.h | 26 +++++++ > mm/mempolicy.c | 13 ++++ > security/security.c | 19 ++++++ > tools/testing/selftests/bpf/prog_tests/mempolicy.c | 79 ++++++++++++++++++++++ > tools/testing/selftests/bpf/progs/test_mempolicy.c | 29 ++++++++ > 6 files changed, 174 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/mempolicy.c > create mode 100644 tools/testing/selftests/bpf/progs/test_mempolicy.c >