From: Kaiyang Zhao <kaiyang2@xxxxxxxxxx> Adding some performance results from testing on a *real* system with CXL memory to demonstrate the values of the patches. The system has 256GB local DRAM + 64GB CXL memory. We stack two workloads together in two cgroups. One is a microbenchmark that allocates memory and accesses it at tunable hotness levels. It allocates 256GB of memory and accesses it in sequential passes with a very hot access pattern (~1 second per pass). The other workload is 64 instances of 520.omnetpp_r from SPEC CPU 2017, which uses about 14GB of memory in total. We apply memory bandwidth limits (1 Gbps memory bandwidth per logical core) and LLC contention mitigation by setting cpuset for each cgroup. Case 1: omnetpp running without the microbenchmark. It is able to use all local memory and without resource contention. This is the optimal case. Avg rate reported by SPEC= 84.7 Case 2: Running two workloads stacked without the fairness patches and start the microbenchmark first. Avg= 62.7 (-25.9%) Case 3: Set memory.low = 19GB for both workloads This is enough memory local low protection for the entire memory usage of omnetpp. Avg = 75.3 (-11.1%) Analysis: omnetpp still uses significant CXL memory (up to 3GB) by the time it finishes because the hint faults for it only triggers for a few seconds in the ~20 minute runtime. Due to the short runtime of the workload and how tiering currently works, it finishes before the memory usage converges to the point where all its memory use is local. However, this still represents a significant improvement over case 2. Case 4: Set memory.low = 19GB for both workloads. Set memory.high = 257GB for the microbenchmark. Avg= 84.0 (<1% difference with case 1) Analysis: by setting both memory.low and memory.high, the usage of local memory is essentially provisioned for the microbenchmark. Therefore, even if the microbenchmark starts first, when omnetpp starts it can get all local memory from the very beginning and achieve near non-colocated performance. We’re working on getting performance data from Meta’s production workloads. Stay tuned for more results.