On Sat, Dec 23, 2023 at 10:35 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > On Sat, Dec 23, 2023 at 8:16 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > On Thu, Dec 14, 2023 at 7:51 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > > > > > Background > > > ========== > > > > > > In our containerized environment, we've identified unexpected OOM events > > > where the OOM-killer terminates tasks despite having ample free memory. > > > This anomaly is traced back to tasks within a container using mbind(2) to > > > bind memory to a specific NUMA node. When the allocated memory on this node > > > is exhausted, the OOM-killer, prioritizing tasks based on oom_score, > > > indiscriminately kills tasks. > > > > > > The Challenge > > > ============= > > > > > > In a containerized environment, independent memory binding by a user can > > > lead to unexpected system issues or disrupt tasks being run by other users > > > on the same server. If a user genuinely requires memory binding, we will > > > allocate dedicated servers to them by leveraging kubelet deployment. > > > > > > Currently, users possess the ability to autonomously bind their memory to > > > specific nodes without explicit agreement or authorization from our end. > > > It's imperative that we establish a method to prevent this behavior. > > > > > > Proposed Solution > > > ================= > > > > > > - Capability > > > Currently, any task can perform MPOL_BIND without specific capabilities. > > > Enforcing CAP_SYS_RESOURCE or CAP_SYS_NICE could be an option, but this > > > may have unintended consequences. Capabilities, being broad, might grant > > > unnecessary privileges. We should explore alternatives to prevent > > > unexpected side effects. > > > > > > - LSM > > > Introduce LSM hooks for syscalls such as mbind(2) and set_mempolicy(2) > > > to disable MPOL_BIND. This approach is more flexibility and allows for > > > fine-grained control without unintended consequences. A sample LSM BPF > > > program is included, demonstrating practical implementation in a > > > production environment. > > > > > > - seccomp > > > seccomp is relatively heavyweight, making it less suitable for > > > enabling in our production environment: > > > - Both kubelet and containers need adaptation to support it. > > > - Dynamically altering security policies for individual containers > > > without interrupting their operations isn't straightforward. > > > > > > Future Considerations > > > ===================== > > > > > > In addition, there's room for enhancement in the OOM-killer for cases > > > involving CONSTRAINT_MEMORY_POLICY. It would be more beneficial to > > > prioritize selecting a victim that has allocated memory on the same NUMA > > > node. My exploration on the lore led me to a proposal[0] related to this > > > matter, although consensus seems elusive at this point. Nevertheless, > > > delving into this specific topic is beyond the scope of the current > > > patchset. > > > > > > [0]. https://lore.kernel.org/lkml/20220512044634.63586-1-ligang.bdlg@xxxxxxxxxxxxx/ > > > > > > Changes: > > > - v4 -> v5: > > > - Revise the commit log in patch #5. (KP) > > > - v3 -> v4: https://lwn.net/Articles/954126/ > > > - Drop the changes around security_task_movememory (Serge) > > > - RCC v2 -> v3: https://lwn.net/Articles/953526/ > > > - Add MPOL_F_NUMA_BALANCING man-page (Ying) > > > - Fix bpf selftests error reported by bot+bpf-ci > > > - RFC v1 -> RFC v2: https://lwn.net/Articles/952339/ > > > - Refine the commit log to avoid misleading > > > - Use one common lsm hook instead and add comment for it > > > - Add selinux implementation > > > - Other improments in mempolicy > > > - RFC v1: https://lwn.net/Articles/951188/ > > > > > > Yafang Shao (5): > > > mm, doc: Add doc for MPOL_F_NUMA_BALANCING > > > mm: mempolicy: Revise comment regarding mempolicy mode flags > > > mm, security: Add lsm hook for memory policy adjustment > > > security: selinux: Implement set_mempolicy hook > > > selftests/bpf: Add selftests for set_mempolicy with a lsm prog > > > > > > .../admin-guide/mm/numa_memory_policy.rst | 27 +++++++ > > > include/linux/lsm_hook_defs.h | 3 + > > > include/linux/security.h | 9 +++ > > > include/uapi/linux/mempolicy.h | 2 +- > > > mm/mempolicy.c | 8 +++ > > > security/security.c | 13 ++++ > > > security/selinux/hooks.c | 8 +++ > > > security/selinux/include/classmap.h | 2 +- > > > .../selftests/bpf/prog_tests/set_mempolicy.c | 84 ++++++++++++++++++++++ > > > .../selftests/bpf/progs/test_set_mempolicy.c | 28 ++++++++ > > > 10 files changed, 182 insertions(+), 2 deletions(-) > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/set_mempolicy.c > > > create mode 100644 tools/testing/selftests/bpf/progs/test_set_mempolicy.c > > > > In your original patchset there was a lot of good discussion about > > ways to solve, or mitigate, this problem using existing mechanisms; > > while you disputed many (all?) of those suggestions, I felt that they > > still had merit over your objections. > > JFYI. The initial patchset presents three suggestions: > - Disabling CONFIG_NUMA, proposed by Michal: > By default, tasks on a server allocate memory from their local > memory node initially. Disabling CONFIG_NUMA could potentially lead to > a performance hit. > > - Adjusting NUMA workload configuration, also from Michal: > This adjustment has been successfully implemented on some dedicated > clusters, as mentioned in the commit log. However, applying this > change universally across a large fleet of servers might result in > significant wastage of physical memory. > > - Implementing seccomp, suggested by Ondrej and Casey: > As indicated in the commit log, altering the security policy > dynamically without interrupting a running container isn't > straightforward. Implementing seccomp requires the introduction of an > eBPF-based seccomp, which constitutes a substantial change. > [ The seccomp maintainer has been added to this mail thread for > further discussion. ] The seccomp filter runs cBFF (classic BPF) and not eBPF; there are a number of sandboxing tools designed to make this easier to use, including systemd, and if you need to augment your existing application there are libraries available to make this easier. > > I also don't believe the > > SELinux implementation of the set_mempolicy hook fits with the > > existing SELinux philosophy of access control via type enforcement; > > outside of some checks on executable memory and low memory ranges, > > SELinux doesn't currently enforce policy on memory ranges like this, > > SELinux focuses more on tasks being able to access data/resources on > > the system. > > > > My current opinion is that you should pursue some of the mitigations > > that have already been mentioned, including seccomp and/or a better > > NUMA workload configuration. I would also encourage you to pursue the > > OOM improvement you briefly described. All of those seem like better > > options than this new LSM/SELinux hook. > > Using the OOM solution should not be our primary approach. Whenever > possible, we should prioritize alternative solutions to prevent > encountering the OOM situation. It's a good thing that there exist other options. -- paul-moore.com