Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

Abel Wu <wuyun.abel@xxxxxxxxxxxxx> · Tue, 12 Jul 2022 23:00:55 +0800

On 7/12/22 9:35 PM, Michal Hocko Wrote:
On Tue 12-07-22 19:12:18, Abel Wu wrote:
[...]
I was just going through the mail list and happen to see this. There
is another usecase for us about per-numa memory usage.

Say we have several important latency-critical services sitting inside
different NUMA nodes without intersection. The need for memory of these
LC services varies, so the free memory of each node is also different.
Then we launch several background containers without cpuset constrains
to eat the left resources. Now the problem is that there doesn't seem
like a proper memory policy available to balance the usage between the
nodes, which could lead to memory-heavy LC services suffer from high
memory pressure and fails to meet the SLOs.

I do agree that cpusets would be rather clumsy if usable at all in a
scenario when you are trying to mix NUMA bound workloads with those
that do not have any NUMA proferences. Could you be more specific about
requirements here though?

Yes, these LC services are highly sensitive to memory access latency
and bandwidth, so they are provisioned by NUMA node granule to meet
their performance requirements. While on the other hand, they usually
do not make full use of cpu/mem resources which increases the TCO of
our IDCs, so we have to co-locate them with background tasks.

Some of these LC services are memory-bound but leave much of cpu's
capacity unused. In this case we hope the co-located background tasks
to consume some leftover without introducing obvious mm overhead to
the LC services.

Let's say you run those latency critical services with "simple" memory
policies and mix them with the other workload without any policies in
place so they compete over memory. It is not really clear to me how can
you achieve any reasonable QoS in such an environment. Your latency
critical servises will be more constrained than the non-critical ones
yet they are more demanding AFAIU.

Yes, the QoS over memory is the biggest block in the way (the other
resources are relatively easier). For now, we hacked a new mpol to
achieve weighted-interleave behavior to balance the memory usage across
NUMA nodes, and only set memcg protections to the LC services. If the
memory pressure is still high, the background tasks will be killed.
Ideas? Thanks!

Abel