On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Thanks (for doing this test, and also to Nadav for all his inputs), and > sorry for a late response. No need to apologize: anyways, I've got you comfortably beat on being late at this point :) > These numbers caught my eye, and I'm very curious why even 2 vcpus can > scale that bad. > > I gave it a shot on a test machine and I got something slightly different: > > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) > $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N > |-------+----------+--------| > | n_thr | per-vcpu | total | > |-------+----------+--------| > | 1 | 39.5K | 39.5K | > | 2 | 33.8K | 67.6K | > | 4 | 31.8K | 127.2K | > | 8 | 30.8K | 246.1K | > | 16 | 21.9K | 351.0K | > |-------+----------+--------| > > I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I > don't have two threads content on a core/thread already since I only got 40 > hardware threads there, but still we can compare with your lower half. > > When I was testing I noticed bad numbers and another bug on not using > NSEC_PER_SEC properly, so I did this before the test: > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@xxxxxxxxxx/ > > I think it means it still doesn't scale that good, however not so bad > either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of > paths triggered in the test so I also don't expect it to fully scale > linearly. From my numbers I just didn't see as drastic as yours. I'm not > sure whether it's simply broken test number, parameter differences > (e.g. you used 64M only per-vcpu), or hardware differences. Hmm, I suspect we're dealing with hardware differences here. I rebased my changes onto those two patches you sent up, taking care not to clobber them, but even with the repro command you provided my results look very different than yours (at least on 1-4 vcpus) on the machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz). (n=20) n_thr per_vcpu total 1 154K 154K 2 92k 184K 4 71K 285K 8 36K 291K 16 19K 310K Out of interested I tested on another machine (Intel(R) Xeon(R) Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit different again (n=20) n_thr per_vcpu total 1 115K 115K 2 103k 206K 4 65K 262K 8 39K 319K 16 19K 398K It is interesting how all three sets of numbers start off different but seem to converge around 16 vCPUs. I did check to make sure the memory fault exits sped things up in all cases, and that at least stays true. By the way, I've got a little helper script that I've been using to run/average the selftest results (which can vary quite a bit). I've attached it below- hopefully it doesn't bounce from the mailing list. Just for reference, the invocation to test the command you provided is > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
import subprocess import argparse import re def get_command(percpu_mem, cores, single_uffd, use_memfaults, overlap_vcpus): if overlap_vcpus and not single_uffd: raise RuntimeError("Overlapping vcpus but not using single uffd, very strange") return "./demand_paging_test -s shmem -u MINOR " \ + " -b " + percpu_mem \ + (" -a " if single_uffd or overlap_vcpus else "") \ + (" -o " if overlap_vcpus else "") \ + " -v " + str(cores) \ + " -r " + (str(cores) if single_uffd or overlap_vcpus else "1") \ + (" -w" if use_memfaults else "") \ + "; exit 0" def run_command(cmd): output = subprocess.check_output(cmd, shell=True) v_paging_rate_re = r"Per-vcpu demand paging rate:\s*(.*) pgs/sec" t_paging_rate_re = r"Overall demand paging rate:\s*(.*) pgs/sec" v_match = re.search(v_paging_rate_re, output, re.MULTILINE) t_match = re.search(t_paging_rate_re, output, re.MULTILINE) return float(v_match.group(1)), float(t_match.group(1)) if __name__ == "__main__": ap = argparse.ArgumentParser() ap.add_argument("--num_runs", type=int, dest='num_runs', required=True) ap.add_argument("--max_cores", type=int, dest='max_cores', required=True) ap.add_argument("--percpu_mem", type=str, dest='percpu_mem', required=True) ap.add_argument("--oneuffd", type=bool, dest='oneuffd') ap.add_argument("--overlap", type=bool, dest='overlap') ap.add_argument("--memfaults", type=bool, dest='memfaults') args = ap.parse_args() print("Testing configuration: " + str(args)) print("") cores = 1 cores_arr = [] results = [] while cores <= args.max_cores: cmd = get_command(args.percpu_mem, cores, args.oneuffd, args.memfaults, args.overlap) if cores == 1 or cores == 2: print("cmd = " + cmd) print("Testing cores = " + str(cores)) full_results = [run_command(cmd) for _ in range(args.num_runs)] v_rates = [f[0] for f in full_results] t_rates = [f[1] for f in full_results] def print_rates(tag, rates): average = sum(rates) / len(rates) print(tag + ":\t\t" + str(int(average / 10) / 100)) print_rates("Vcpu demand paging rate", v_rates) print_rates("Total demand paging rate", t_rates) cores_arr.append(cores) results.append((cores, v_rates, t_rates)) cores *= 2 for c, v_rates, t_rates in results: print("Full results on core " + str(c) + " :\n" + str(v_rates) + "\n" + str(t_rates))