Hi, Nadav, On Fri, May 05, 2023 at 01:05:02PM -0700, Nadav Amit wrote: > > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32 > > > > It seems to me for some reason the scheduler ate more than I expected.. > > Maybe tomorrow I can try two more things: > > > > - Do cpu isolations, and > > - pin reader threads too (or just leave the readers on housekeeping cores) > > For the record (and I hope I do not repeat myself): these scheduler overheads > is something that I have encountered before. > > The two main solutions I tried were: > > 1. Optional polling on the faulting thread to avoid context switch on the > faulting thread. > > (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@xxxxxxxxxx/ ) > > and > > 2. IO-uring to avoid context switch on the handler thread. > > In addition, as I mentioned before, the queue locks is something that can be > simplified. Right, thanks for double checking on that. Though do you think these are two separate issues to be looked into? One thing on reducing context switch overhead with a static configuration, which I think is what can be resolved by what you mentioned above, and the iouring series. One thing on the possibility of scaling userfaultfd over splitting guest memory into a few chunks (literally demand paging test with no -a). Logically I think it should scale if with pcpu pinning on vcpu threads to avoid kvm bottlenecks around. Side note: IIUC none of above will resolve the problem right now if we assume we can only have 1 uffd to register to the guest mem. However I'm curious on testing multi-uffd because I wanted to make sure there's no other thing stops the whole system from scaling with threads, hence I'd expect to get higher fault/sec overall if we increase the cores we use in the test. If it already cannot scale for whatever reason then it means a generic solution may not be possible at least for kvm use case. While OTOH if multi-uffd can scale well, then there's a chance of general solution as long as we can remove the single-queue contention over the whole guest mem. PS: Nadav, I think you mentioned twice on avoiding taking two locks for the fault queue, which sounds reasonable. Do you have plan to post a patch? Thanks, -- Peter Xu