Re: [Phishing Risk] [External] Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

Abel Wu <wuyun.abel@xxxxxxxxxxxxx> · Wed, 5 May 2021 13:06:09 +0800

ping :)

On 4/27/21 10:43 PM, Tejun Heo wrote:
Hello,

On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
When a NUMA node is assigned to numa-service, the workload
on that node needs to be moved away fast and complete. The
main aspects we cared about on the eviction are as follows:

a) it should complete soon enough so that numa-services
    won’t wait too long to hurt user experience
b) the workloads to be evicted could have massive usage on
    memory, and migrating such amount of memory may lead to
    a sudden severe performance drop lasting tens of seconds
    that some certain workloads may not afford
c) the impact of the eviction should be limited within the
    source and destination nodes
d) cgroup interface is preferred

So we come to a thought that:

1) fire up numa-services without waiting for memory migration
2) memory migration can be done asynchronously by using spare
    memory bandwidth

AutoNUMA seems to be a solution, but its scope is global which
violates c&d. And cpuset.memory_migrate performs in a synchronous

I don't think d) in itself is a valid requirement. How does it violate c)?

fashion which breaks a&b. So a mixture of them, the new cgroup2
interface cpuset.mems.migration, is introduced.

The new cpuset.mems.migration supports three modes:

  - "none" mode, meaning migration disabled
  - "sync" mode, which is exactly the same as the cgroup v1
    interface cpuset.memory_migrate
  - "lazy" mode, when walking through all the pages, unlike
    cpuset.memory_migrate, it only sets pages to protnone,
    and numa faults triggered by later touch will handle the
    movement.

cpuset is already involved in NUMA allocation but it always felt like
something bolted on - it's weird to have cpu to NUMA node settings at global
level and then to have possibly conflicting direct NUMA configuration via
cpuset. My preference would be putting as much configuration as possible on
the mm / autonuma side and let cpuset's node confinements further restrict
their operations rather than cpuset having its own set of policy
configurations.

Johannes, what are your thoughts?

Thanks.