Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion

Kaiyang Zhao <kaiyang2@xxxxxxxxxx> · Fri, 11 Oct 2024 20:51:15 +0000

Adding some preliminary results from testing on a *real* system with CXL
memory.

The system has 256GB local DRAM + 64GB CXL memory. We used a microbenchmark
that allocates memory and accesses it at tunable hotness levels. We ran 3 such
microbenchmarks in 3 cgroups. The first container has 2 times the access
hotness than the second and the third container. All containers have a 100GB
memory.low set, meaning that ~82GB of local DRAM usage is protected.

Case 1 Container 1: Uses 120GB Container 2: Uses 40GB Container 3: Uses 40GB

Without fairness patch: same as with fairness.

With fairness patch: Container 1 has 120GB in local DRAM. Container 2 and 3
each have 40GB in local DRAM. As long as DRAM memory is not under pressure,
containers can exceed the lower guarantee and put everything in DRAM.

Case 2: Container 1: Uses 120GB Container 2: Uses 90GB Container 3: Uses 90GB

Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.

With fairness patch: Container 1 starts early and gets all 120GB in DRAM
memory. As container 2 and 3 start, they initially each get ~65GB in DRAM and
~25GB in CXL memory. Promotion attempts trigger local memory reclaim by kswapd,
which trims the DRAM usage by container 1 and increases the DRAM usage of
container 2 and 3. Eventually, the usage of DRAM memory for all 3 containers
converges at ~82GB, and the excess unprotected usage of 3 containers is in CXL
memory.

Case 3:

Container 1: Uses 120GB Container 2: Uses 70GB Container 3: Uses 70GB

Without fairness patch: Container 1 gets 120GB in local DRAM, and Container 2
and 3 are stuck with ~65GB in local DRAM since they have colder data.

With fairness patch: While the total memory demand exceeds DRAM capacity, at
the stable state, Container 1 is still able to get ~105GB in local DRAM, more
than the lower guarantee. Meanwhile, all memory usage by Container 2 and 3 are
protected from the noisy neighbor Container 1 and resides in DRAM only.

We’re working on getting performance data from more benchmarks and also Meta’s
production workloads. Stay tuned for more results!