Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Anish Moorthy <amoorthy@xxxxxxxxxx> · Wed, 3 May 2023 12:45:07 -0700

On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> Thanks (for doing this test, and also to Nadav for all his inputs), and
> sorry for a late response.

No need to apologize: anyways, I've got you comfortably beat on being
late at this point :)

> These numbers caught my eye, and I'm very curious why even 2 vcpus can
> scale that bad.
>
> I gave it a shot on a test machine and I got something slightly different:
>
>   Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads)
>   $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N
>   |-------+----------+--------|
>   | n_thr | per-vcpu | total  |
>   |-------+----------+--------|
>   |     1 | 39.5K    | 39.5K  |
>   |     2 | 33.8K    | 67.6K  |
>   |     4 | 31.8K    | 127.2K |
>   |     8 | 30.8K    | 246.1K |
>   |    16 | 21.9K    | 351.0K |
>   |-------+----------+--------|
>
> I used larger ram due to less cores.  I didn't try 32+ vcpus to make sure I
> don't have two threads content on a core/thread already since I only got 40
> hardware threads there, but still we can compare with your lower half.
>
> When I was testing I noticed bad numbers and another bug on not using
> NSEC_PER_SEC properly, so I did this before the test:
>
> https://lore.kernel.org/all/20230427201112.2164776-1-peterx@xxxxxxxxxx/
>
> I think it means it still doesn't scale that good, however not so bad
> either - no obvious 1/2 drop on using 2vcpus.  There're still a bunch of
> paths triggered in the test so I also don't expect it to fully scale
> linearly.  From my numbers I just didn't see as drastic as yours. I'm not
> sure whether it's simply broken test number, parameter differences
> (e.g. you used 64M only per-vcpu), or hardware differences.

Hmm, I suspect we're dealing with  hardware differences here. I
rebased my changes onto those two patches you sent up, taking care not
to clobber them, but even with the repro command you provided my
results look very different than yours (at least on 1-4 vcpus) on the
machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz).

(n=20)
n_thr      per_vcpu       total
1            154K              154K
2             92k                184K
4             71K                285K
8             36K                291K
16           19K                310K

Out of interested I tested on another machine (Intel(R) Xeon(R)
Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit
different again

(n=20)
n_thr      per_vcpu       total
1            115K              115K
2             103k              206K
4             65K                262K
8             39K                319K
16           19K                398K

It is interesting how all three sets of numbers start off different
but seem to converge around 16 vCPUs. I did check to make sure the
memory fault exits sped things up in all cases, and that at least
stays true.

By the way, I've got a little helper script that I've been using to
run/average the selftest results (which can vary quite a bit). I've
attached it below- hopefully it doesn't bounce from the mailing list.
Just for reference, the invocation to test the command you provided is

> python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M
import subprocess
import argparse
import re

def get_command(percpu_mem, cores, single_uffd, use_memfaults, overlap_vcpus):
   if overlap_vcpus and not single_uffd:
       raise RuntimeError("Overlapping vcpus but not using single uffd, very strange")
   return "./demand_paging_test -s shmem -u MINOR " \
        + " -b " + percpu_mem \
        + (" -a " if single_uffd or overlap_vcpus else "") \
        + (" -o " if overlap_vcpus else "") \
        + " -v " + str(cores) \
        + " -r " + (str(cores) if single_uffd or overlap_vcpus else "1") \
        + (" -w" if use_memfaults else "") \
        + "; exit 0"

def run_command(cmd):

    output = subprocess.check_output(cmd, shell=True)
    v_paging_rate_re = r"Per-vcpu demand paging rate:\s*(.*) pgs/sec"
    t_paging_rate_re = r"Overall demand paging rate:\s*(.*) pgs/sec"
    v_match = re.search(v_paging_rate_re, output, re.MULTILINE)
    t_match = re.search(t_paging_rate_re, output, re.MULTILINE)
    return float(v_match.group(1)), float(t_match.group(1))

if __name__ == "__main__":
    ap = argparse.ArgumentParser()
    ap.add_argument("--num_runs", type=int, dest='num_runs', required=True)
    ap.add_argument("--max_cores", type=int, dest='max_cores', required=True)
    ap.add_argument("--percpu_mem", type=str, dest='percpu_mem', required=True)
    ap.add_argument("--oneuffd", type=bool, dest='oneuffd')
    ap.add_argument("--overlap", type=bool, dest='overlap')
    ap.add_argument("--memfaults", type=bool, dest='memfaults')

    args = ap.parse_args()

    print("Testing configuration: " + str(args))
    print("")

    cores = 1
    cores_arr = []
    results = []
    while cores <= args.max_cores:
        cmd = get_command(args.percpu_mem, cores, args.oneuffd, args.memfaults, args.overlap)
        if cores == 1 or cores == 2:
            print("cmd = " + cmd)

        print("Testing cores = " + str(cores))
        full_results = [run_command(cmd) for _ in range(args.num_runs)]
        v_rates = [f[0] for f in full_results]
        t_rates = [f[1] for f in full_results]

        def print_rates(tag, rates):
            average = sum(rates) / len(rates)
            print(tag + ":\t\t" + str(int(average / 10) / 100))

        print_rates("Vcpu demand paging rate", v_rates)
        print_rates("Total demand paging rate", t_rates)

        cores_arr.append(cores)
        results.append((cores, v_rates, t_rates))
        cores *= 2

    for c, v_rates, t_rates in results:
        print("Full results on core " + str(c) + " :\n" + str(v_rates) + "\n" + str(t_rates))