Re: [RFC PATCH 0/4] memory tiering fairness by per-cgroup control of promotion and demotion

kaiyang2@xxxxxxxxxx · Fri, 8 Nov 2024 19:01:51 +0000

From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx>

Adding some performance results from testing on a *real* system with CXL memory
to demonstrate the values of the patches.

The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads
together in two cgroups. One is a microbenchmark that allocates memory and
accesses it at tunable hotness levels. It allocates 256GB of memory and
accesses it in sequential passes with a very hot access pattern (~1 second per
pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017,
which uses about 14GB of memory in total. We apply memory bandwidth limits (1
Gbps memory bandwidth per logical core) and LLC contention mitigation by
setting cpuset for each cgroup.

Case 1: omnetpp running without the microbenchmark.
It is able to use all local memory and without resource contention. This is
the optimal case.
Avg rate reported by SPEC= 84.7

Case 2: Running two workloads stacked without the fairness patches and start
the microbenchmark first.
Avg= 62.7 (-25.9%)

Case 3: Set memory.low = 19GB for both workloads This is enough memory local
low protection for the entire memory usage of omnetpp.
Avg = 75.3 (-11.1%)
Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it
finishes because the hint faults for it only triggers for a few seconds in the
~20 minute runtime. Due to the short runtime of the workload and how tiering
currently works, it finishes before the memory usage converges to the point
where all its memory use is local. However, this still represents a significant
improvement over case 2.

Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for
the microbenchmark. 
Avg= 84.0 (<1% difference with case 1)
Analysis: by setting both memory.low and memory.high, the usage of local memory
is essentially provisioned for the microbenchmark. Therefore, even if the
microbenchmark starts first, when omnetpp starts it can get all local memory
from the very beginning and achieve near non-colocated performance.

We’re working on getting performance data from Meta’s production workloads.
Stay tuned for more results.