Oops, bounced back from the list.. Forward with no attachment this time - I assume the information is still enough in the paragraphs even without the flamegraphs. Sorry for the noise. On Wed, May 03, 2023 at 05:18:13PM -0400, Peter Xu wrote: > On Wed, May 03, 2023 at 12:45:07PM -0700, Anish Moorthy wrote: > > On Thu, Apr 27, 2023 at 1:26 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > Thanks (for doing this test, and also to Nadav for all his inputs), and > > > sorry for a late response. > > > > No need to apologize: anyways, I've got you comfortably beat on being > > late at this point :) > > > > > These numbers caught my eye, and I'm very curious why even 2 vcpus can > > > scale that bad. > > > > > > I gave it a shot on a test machine and I got something slightly different: > > > > > > Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) > > > $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N > > > |-------+----------+--------| > > > | n_thr | per-vcpu | total | > > > |-------+----------+--------| > > > | 1 | 39.5K | 39.5K | > > > | 2 | 33.8K | 67.6K | > > > | 4 | 31.8K | 127.2K | > > > | 8 | 30.8K | 246.1K | > > > | 16 | 21.9K | 351.0K | > > > |-------+----------+--------| > > > > > > I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I > > > don't have two threads content on a core/thread already since I only got 40 > > > hardware threads there, but still we can compare with your lower half. > > > > > > When I was testing I noticed bad numbers and another bug on not using > > > NSEC_PER_SEC properly, so I did this before the test: > > > > > > https://lore.kernel.org/all/20230427201112.2164776-1-peterx@xxxxxxxxxx/ > > > > > > I think it means it still doesn't scale that good, however not so bad > > > either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of > > > paths triggered in the test so I also don't expect it to fully scale > > > linearly. From my numbers I just didn't see as drastic as yours. I'm not > > > sure whether it's simply broken test number, parameter differences > > > (e.g. you used 64M only per-vcpu), or hardware differences. > > > > Hmm, I suspect we're dealing with hardware differences here. I > > rebased my changes onto those two patches you sent up, taking care not > > to clobber them, but even with the repro command you provided my > > results look very different than yours (at least on 1-4 vcpus) on the > > machine I've been testing on (4x AMD EPYC 7B13 64-Core, 2.2GHz). > > > > (n=20) > > n_thr per_vcpu total > > 1 154K 154K > > 2 92k 184K > > 4 71K 285K > > 8 36K 291K > > 16 19K 310K > > > > Out of interested I tested on another machine (Intel(R) Xeon(R) > > Platinum 8273CL CPU @ 2.20GHz) as well, and results are a bit > > different again > > > > (n=20) > > n_thr per_vcpu total > > 1 115K 115K > > 2 103k 206K > > 4 65K 262K > > 8 39K 319K > > 16 19K 398K > > Interesting. > > > > > It is interesting how all three sets of numbers start off different > > but seem to converge around 16 vCPUs. I did check to make sure the > > memory fault exits sped things up in all cases, and that at least > > stays true. > > > > By the way, I've got a little helper script that I've been using to > > run/average the selftest results (which can vary quite a bit). I've > > attached it below- hopefully it doesn't bounce from the mailing list. > > Just for reference, the invocation to test the command you provided is > > > > > python dp_runner.py --num_runs 20 --max_cores 16 --percpu_mem 512M > > I found that indeed I shouldn't have stopped at 16 vcpus since that's > exactly where it starts to bottleneck. :) > > So out of my curiosity I tried to profile 32 vcpus case on my system with > this test case, meanwhile I tried it both with: > > - 1 uffd + 8 readers > - 32 uffds (so 32 readers) > > I've got the flamegraphs attached for both. > > It seems that when using >1 uffds the bottleneck is not the spinlock > anymore but something else. > > From what I got there, vmx_vcpu_load() gets more highlights than the > spinlocks. I think that's the tlb flush broadcast. > > While OTOH indeed when using 1 uffd we can see obviously the overhead of > spinlock contention on either the fault() path or read()/poll() as you and > James rightfully pointed out. > > I'm not sure whether my number is caused by special setup, though. After > all I only had 40 threads and I started 32 vcpus + 8 readers and there'll > be contention already between the workloads. > > IMHO this means that there's still chance to provide a more generic > userfaultfd scaling solution as long as we can remove the single spinlock > contention on the fault/fault_pending queues. I'll see whether I can still > explore a bit on the possibility of this and keep you guys updated. The > general idea here to me is still to make multi-queue out of 1 uffd. > > I _think_ this might also be a positive result to your work, because if the > bottleneck is not userfaultfd (as we scale it with creating multiple; > ignoring the split vma effect), then it cannot be resolved by scaling > userfaultfd alone anyway, anymore. So a general solution, even if existed, > may not work here for kvm, because we'll get stuck somewhere else already. > > -- > Peter Xu -- Peter Xu