Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 24 Apr 2023 17:54:31 -0700

> On Apr 24, 2023, at 5:15 PM, Anish Moorthy <amoorthy@xxxxxxxxxx> wrote:
> 
> On Mon, Apr 24, 2023 at 12:44 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Apr 24, 2023, at 10:54 AM, Anish Moorthy <amoorthy@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, Apr 21, 2023 at 10:40 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>>>> 
>>>> If I understand the problem correctly, it sounds as if the proper solution
>>>> should be some kind of a range-locks. If it is too heavy or the interface can
>>>> be changed/extended to wake a single address (instead of a range),
>>>> simpler hashed-locks can be used.
>>> 
>>> Some sort of range-based locking system does seem relevant, although I
>>> don't see how that would necessarily speed up the delivery of faults
>>> to UFFD readers: I'll have to think about it more.
>> 
>> Perhaps I misread your issue. Based on the scalability issues you raised,
>> I assumed that the problem you encountered is related to lock contention.
>> I do not know whether your profiled it, but some information would be
>> useful.
> 
> No, you had it right: the issue at hand is contention on the uffd wait
> queues. I'm just not sure what the range-based locking would really be
> doing. Events would still have to be delivered to userspace in an
> ordered manner, so it seems to me that each uffd would still need to
> maintain a queue (and the associated contention).

There are 2 queues. One for the pending faults that were still not reported
to userspace, and one for the faults that we might need to wake up. The second
one can have range locks.

Perhaps some hybrid approach would be best: do not block on page-faults that
KVM runs into, which would prevent you from the need to enqueue on fault_wqh.

But I do not know whether the reporting through KVM instead of 
userfaultfd-based mechanism is very clean. I think that an IO-uring based
solution, such as the one I proposed before, would be more generic. Actually,
now that I understand better your use-case, you do not need a core to poll
and you would just be able to read the page-fault information from the IO-uring.

Then, you can report whether the page-fault blocked or not in a flag.

> 
> With respect to the "sharding" idea, I collected some more runs of the
> self test (full command in [1]). This time I omitted the "-a" flag, so
> that every vCPU accesses a different range of guest memory with its
> own UFFD, and set the number of reader threads per UFFD to 1.

Just wondering, did you run the benchmark with DONTWAKE? Sounds as if the
wake is not needed.