On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > Background > > ========== > > > > In our containerized environment, we've identified unexpected OOM events > > where the OOM-killer terminates tasks despite having ample free memory. > > This anomaly is traced back to tasks within a container using mbind(2) to > > bind memory to a specific NUMA node. When the allocated memory on this node > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > indiscriminately kills tasks. > > > > The Challenge > > ============= > > > > In a containerized environment, independent memory binding by a user can > > lead to unexpected system issues or disrupt tasks being run by other users > > on the same server. If a user genuinely requires memory binding, we will > > allocate dedicated servers to them by leveraging kubelet deployment. > > > > Currently, users possess the ability to autonomously bind their memory to > > specific nodes without explicit agreement or authorization from our end. > > It's imperative that we establish a method to prevent this behavior. > > > > Proposed Solution > > ================= > > > > - Capability > > Currently, any task can perform MPOL_BIND without specific capabilities. > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > > may have unintended consequences. Capabilities, being broad, might grant > > unnecessary privileges. We should explore alternatives to prevent > > unexpected side effects. > > > > - LSM > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2) > > to disable MPOL_BIND. This approach is more flexibility and allows for > > fine-grained control without unintended consequences. A sample LSM BPF > > program is included, demonstrating practical implementation in a > > production environment. > > > > - seccomp > > seccomp is relatively heavyweight, making it less suitable for > > enabling in our production environment: > > - Both kubelet and containers need adaptation to support it. > > - Dynamically altering security policies for individual containers > > without interrupting their operations isn't straightforward. > > > > Future Considerations > > ===================== > > > > In addition, there's room for enhancement in the OOM-killer for cases > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > > prioritize selecting a victim that has allocated memory on the same NUMA > > node. My exploration on the lore led me to a proposal[0] related to this > > matter, although consensus seems elusive at this point. Nevertheless, > > delving into this specific topic is beyond the scope of the current > > patchset. > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@xxxxxxxxxxxxx/ > > > > Changes: > > - v4 -> v5: > > - Revise the commit log in patch #5. (KP) > > - v3 -> v4: https://lwn.net/Articles/954126/ > > - Drop the changes around security_task_movememory (Serge) > > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > > - Fix bpf selftests error reported by bot+bpf-ci > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > > - Refine the commit log to avoid misleading > > - Use one common lsm hook instead and add comment for it > > - Add selinux implementation > > - Other improments in mempolicy > > - RFC v1: https://lwn.net/Articles/951188/ > > > > Yafang Shao (5): > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > > mm: mempolicy: Revise comment regarding mempolicy mode flags > > mm, security: Add lsm hook for memory policy adjustment > > security: selinux: Implement set_mempolicy hook > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > > include/linux/lsm_hook_defs.h | 3 + > > include/linux/security.h | 9 +++ > > include/uapi/linux/mempolicy.h | 2 +- > > mm/mempolicy.c | 8 +++ > > security/security.c | 13 ++++ > > security/selinux/hooks.c | 8 +++ > > security/selinux/include/classmap.h | 2 +- > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++ > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > > 10 files changed, 182 insertions(+), 2 deletions(-) > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c > > In your original patchset there was a lot of good discussion about > ways to solve, or mitigate, this problem using existing mechanisms; > while you disputed many (all?) of those suggestions, I felt that they > still had merit over your objections. JFYI. The initial patchset presents three suggestions: - Disabling CONFIG_NUMA, proposed by Michal: By default, tasks on a server allocate memory from their local memory node initially. Disabling CONFIG_NUMA could potentially lead to a performance hit. - Adjusting NUMA workload configuration, also from Michal: This adjustment has been successfully implemented on some dedicated clusters, as mentioned in the commit log. However, applying this change universally across a large fleet of servers might result in significant wastage of physical memory. - Implementing seccomp, suggested by Ondrej and Casey: As indicated in the commit log, altering the security policy dynamically without interrupting a running container isn't straightforward. Implementing seccomp requires the introduction of an eBPF-based seccomp, which constitutes a substantial change. [ The seccomp maintainer has been added to this mail thread for further discussion. ] > I also don't believe the > SELinux implementation of the set_mempolicy hook fits with the > existing SELinux philosophy of access control via type enforcement; > outside of some checks on executable memory and low memory ranges, > SELinux doesn't currently enforce policy on memory ranges like this, > SELinux focuses more on tasks being able to access data/resources on > the system. > > My current opinion is that you should pursue some of the mitigations > that have already been mentioned, including seccomp and/or a better > NUMA workload configuration. I would also encourage you to pursue the > OOM improvement you briefly described. All of those seem like better > options than this new LSM/SELinux hook. Using the OOM solution should not be our primary approach. Whenever possible, we should prioritize alternative solutions to prevent encountering the OOM situation. -- Regards Yafang