Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Peter Xu <peterx@xxxxxxxxxx> · Sun, 7 May 2023 21:12:17 -0400

Hi, Nadav,

On Fri, May 05, 2023 at 01:05:02PM -0700, Nadav Amit wrote:
> > ./demand_paging_test -b 512M -u MINOR -s shmem -v 32 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
> > 
> > It seems to me for some reason the scheduler ate more than I expected..
> > Maybe tomorrow I can try two more things:
> > 
> >  - Do cpu isolations, and
> >  - pin reader threads too (or just leave the readers on housekeeping cores)
> 
> For the record (and I hope I do not repeat myself): these scheduler overheads
> is something that I have encountered before.
> 
> The two main solutions I tried were:
> 
> 1. Optional polling on the faulting thread to avoid context switch on the
>    faulting thread.
> 
> (something like https://lore.kernel.org/linux-mm/20201129004548.1619714-6-namit@xxxxxxxxxx/ )
> 
> and 
> 
> 2. IO-uring to avoid context switch on the handler thread.
> 
> In addition, as I mentioned before, the queue locks is something that can be
> simplified.

Right, thanks for double checking on that.  Though do you think these are
two separate issues to be looked into?

One thing on reducing context switch overhead with a static configuration,
which I think is what can be resolved by what you mentioned above, and the
iouring series.

One thing on the possibility of scaling userfaultfd over splitting guest
memory into a few chunks (literally demand paging test with no -a).
Logically I think it should scale if with pcpu pinning on vcpu threads to
avoid kvm bottlenecks around.

Side note: IIUC none of above will resolve the problem right now if we
assume we can only have 1 uffd to register to the guest mem.

However I'm curious on testing multi-uffd because I wanted to make sure
there's no other thing stops the whole system from scaling with threads,
hence I'd expect to get higher fault/sec overall if we increase the cores
we use in the test.

If it already cannot scale for whatever reason then it means a generic
solution may not be possible at least for kvm use case.  While OTOH if
multi-uffd can scale well, then there's a chance of general solution as
long as we can remove the single-queue contention over the whole guest mem.

PS: Nadav, I think you mentioned twice on avoiding taking two locks for the
fault queue, which sounds reasonable.  Do you have plan to post a patch?

Thanks,

-- 
Peter Xu