Hi, Anish, On Mon, Apr 24, 2023 at 05:15:49PM -0700, Anish Moorthy wrote: > On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > > > > > > > > On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@xxxxxxxxxx> wrote: > > > > > > On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > > >> > > >> If I understand the problem correctly, it sounds as if the proper solution > > >> should be some kind of a range-locks. If it is too heavy or the interface can > > >> be changed/extended to wake a single address (instead of a range), > > >> simpler hashed-locks can be used. > > > > > > Some sort of range-based locking system does seem relevant, although I > > > don't see how that would necessarily speed up the delivery of faults > > > to UFFD readers: I'll have to think about it more. > > > > Perhaps I misread your issue. Based on the scalability issues you raised, > > I assumed that the problem you encountered is related to lock contention. > > I do not know whether your profiled it, but some information would be > > useful. > > No, you had it right: the issue at hand is contention on the uffd wait > queues. I'm just not sure what the range-based locking would really be > doing. Events would still have to be delivered to userspace in an > ordered manner, so it seems to me that each uffd would still need to > maintain a queue (and the associated contention). > > With respect to the "sharding" idea, I collected some more runs of the > self test (full command in [1]). This time I omitted the "-a" flag, so > that every vCPU accesses a different range of guest memory with its > own UFFD, and set the number of reader threads per UFFD to 1. > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > 1 180 307 > 2 85 220 > 4 80 206 > 8 39 163 > 16 18 104 > 32 8 73 > 64 4 57 > 128 1 37 > 256 1 16 > > I'm reporting paging rate on a per-vcpu rather than total basis, which > is why the numbers look so different than the ones in the cover > letter. I'm actually not sure why the demand paging rate falls off > with the number of vCPUs (maybe a prioritization issue on my side?), > but even when UFFDs aren't being contended for it's clear that demand > paging via memory fault exits is significantly faster. > > I'll try to get some perf traces as well: that will take a little bit > of time though, as to do it for cycler will involve patching our VMM > first. > > [1] ./demand_paging_test -b 64M -u MINOR -s shmem -v <n> -r 1 [-w] Thanks (for doing this test, and also to Nadav for all his inputs), and sorry for a late response. These numbers caught my eye, and I'm very curious why even 2 vcpus can scale that bad. I gave it a shot on a test machine and I got something slightly different: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (20 cores, 40 threads) $ ./demand_paging_test -b 512M -u MINOR -s shmem -v N |-------+----------+--------| | n_thr | per-vcpu | total | |-------+----------+--------| | 1 | 39.5K | 39.5K | | 2 | 33.8K | 67.6K | | 4 | 31.8K | 127.2K | | 8 | 30.8K | 246.1K | | 16 | 21.9K | 351.0K | |-------+----------+--------| I used larger ram due to less cores. I didn't try 32+ vcpus to make sure I don't have two threads content on a core/thread already since I only got 40 hardware threads there, but still we can compare with your lower half. When I was testing I noticed bad numbers and another bug on not using NSEC_PER_SEC properly, so I did this before the test: https://lore.kernel.org/all/20230427201112.2164776-1-peterx@xxxxxxxxxx/ I think it means it still doesn't scale that good, however not so bad either - no obvious 1/2 drop on using 2vcpus. There're still a bunch of paths triggered in the test so I also don't expect it to fully scale linearly. From my numbers I just didn't see as drastic as yours. I'm not sure whether it's simply broken test number, parameter differences (e.g. you used 64M only per-vcpu), or hardware differences. -- Peter Xu