Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

Emanuele Giuseppe Esposito <eesposit@xxxxxxxxxx> · Fri, 23 Sep 2022 15:38:14 +0200

Am 23/09/2022 um 15:21 schrieb David Hildenbrand:
> On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
>>
>>
>> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>>> On 19.09.22 09:53, David Hildenbrand wrote:
>>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>>
>>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>>> KVM is currently capable of receiving a single memslot update
>>>>>>> through
>>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>>> The problem arises when we want to atomically perform multiple
>>>>>>> updates,
>>>>>>> so that readers of memslot active list avoid seeing incomplete
>>>>>>> states.
>>>>>>>
>>>>>>> For example, in RHBZ
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>>
>>>>>> I don't have access.  Can you provide a TL;DR?
>>>>>
>>>>> You should be able to have access to it now.
>>>>>
>>>>>>
>>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>>> will se a partial update (old memslot delete, new one not yet
>>>>>>> created)
>>>>>>> and will crash.
>>>>>>
>>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>>> of a complexity
>>>>>> to take on for something that appears to be solvable in userspace.
>>>>>>
>>>>>
>>>>> I think it is not that easy to solve in userspace: see
>>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it
>>>>> will
>>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>>> when called from vcpu context e.g., in virtio code)."
>>>>
>>>> Can you please comment on the bigger picture? The patch from me works
>>>> around *exactly that*, and for that reason, contains that comment.
>>>>
>>>
>>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>>> transactions (not just resizes) where ranges do add overlap with ranges
>>> to remove.
>>>
>>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>>
>>>
>>> I only performed simple boot check under x86-64 (where I can see region
>>> resizes) and some make checks -- pretty sure it has some rough edges;
>>> but should indicate what's possible and what the possible price might
>>> be. [one could wire up a new KVM ioctl and call it conditionally on
>>> support if really required]
>>>
>>
>> A benefit of my ioctl implementation is that could be also used by other
>> hypervisors, which then do not need to share this kind of "hack".
>> However, after also talking with Maxim and Paolo, we all agreed that the
>> main disadvantage of your approach is that is not scalable with the
>> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
>> memslot update (KVM only pauses vCPUs that access the modified memslots
>> instead).
>>
>> So I took some measurements, to see what is the performance difference
>> between my implementation and yours. I used a machine where I could
>> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
>> Processor with kernel 5.19.0 (+ my patches).
>>
>> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
>> memslot updates) while booting a VM. In other words, if kvm_commit takes
>> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
>> not called anymore after boot, so this measurement is easy to compare
>> over multiple invocations of QEMU.
>>
>> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
>> command is the same to replicate the bug:
>> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
>> --display none >> ~/smp_$v;
>>
>> Each boot is reproduced 100 times, and then from results I measure
>> average and stddev (in milliseconds).
>>
>> ioctl:
>> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
>> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
>> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
>> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
>> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
>> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
>>
>>
>> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
>> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
>> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
>> -smp 4:        Average: 3.1ms        Stdev: 1.3m
>> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
>> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
>> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
>>
>>
>> Above 24 vCPUs performance gets worse quickly but I think it's already
>> quite clear that the results for ioctl scale better as the number of
>> vcpus increases, while pausing the vCPUs becomes slower already with 16
>> vcpus.
> 
> Right, the question is if it happens sufficiently enough that we even
> care and if there are not ways to mitigate.
> 
> It doesn't necessarily have to scale with the #VCPUs I think. What
> should dominate the overall time in theory how long it takes for one
> VCPU (the slowest one) to leave the kernel.
> 
> I wondered if
> 
> 1) it might be easier to have a single KVM mechanism/call to kick all
> VCPUs out of KVM instead of doing it per VCPU. That might speed up
> things eventually heavily already.

So if I understand correclty, this implies creating a new ioctl in KVM
anyways? What would be then the difference with what I do? We are
affecting the kernel anyways.

> 
> 2) One thing I wondered is whether the biggest overhead is actually
> taking the locks in QEMU and not actually waiting for the VCPUs. Maybe
> we could optimize that as well. (for now I use one lock per VCPU because
> it felt like it would reduce the ioctl overhead; maybe there is a better
> alternative to balance between both users)
> 
> So treat my patch as a completely unoptimized version.
> 
For what is worth, also my version performs #invalidate+1 swaps, which
is not optimized.

Honestly, I don't see how the above is easier or simpler than what is
being proposed here.

Thank you,
Emanuele