Am 23/09/2022 um 15:21 schrieb David Hildenbrand: > On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote: >> >> >> Am 19/09/2022 um 19:30 schrieb David Hildenbrand: >>> On 19.09.22 09:53, David Hildenbrand wrote: >>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote: >>>>> >>>>> >>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson: >>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote: >>>>>>> KVM is currently capable of receiving a single memslot update >>>>>>> through >>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl. >>>>>>> The problem arises when we want to atomically perform multiple >>>>>>> updates, >>>>>>> so that readers of memslot active list avoid seeing incomplete >>>>>>> states. >>>>>>> >>>>>>> For example, in RHBZ >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276 >>>>>> >>>>>> I don't have access. Can you provide a TL;DR? >>>>> >>>>> You should be able to have access to it now. >>>>> >>>>>> >>>>>>> we see how non atomic updates cause boot failure, because vcpus >>>>>>> will se a partial update (old memslot delete, new one not yet >>>>>>> created) >>>>>>> and will crash. >>>>>> >>>>>> Why not simply pause vCPUs in this scenario? This is an awful lot >>>>>> of a complexity >>>>>> to take on for something that appears to be solvable in userspace. >>>>>> >>>>> >>>>> I think it is not that easy to solve in userspace: see >>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@xxxxxxxxxx/ >>>>> >>>>> >>>>> >>>>> >>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it >>>>> will >>>>> temporarily drop the BQL - something most callers can't handle (esp. >>>>> when called from vcpu context e.g., in virtio code)." >>>> >>>> Can you please comment on the bigger picture? The patch from me works >>>> around *exactly that*, and for that reason, contains that comment. >>>> >>> >>> FWIW, I hacked up my RFC to perform atomic updates on any memslot >>> transactions (not just resizes) where ranges do add overlap with ranges >>> to remove. >>> >>> https://github.com/davidhildenbrand/qemu/tree/memslot >>> >>> >>> I only performed simple boot check under x86-64 (where I can see region >>> resizes) and some make checks -- pretty sure it has some rough edges; >>> but should indicate what's possible and what the possible price might >>> be. [one could wire up a new KVM ioctl and call it conditionally on >>> support if really required] >>> >> >> A benefit of my ioctl implementation is that could be also used by other >> hypervisors, which then do not need to share this kind of "hack". >> However, after also talking with Maxim and Paolo, we all agreed that the >> main disadvantage of your approach is that is not scalable with the >> number of vcpus. It is also inferior to stop *all* vcpus just to allow a >> memslot update (KVM only pauses vCPUs that access the modified memslots >> instead). >> >> So I took some measurements, to see what is the performance difference >> between my implementation and yours. I used a machine where I could >> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core >> Processor with kernel 5.19.0 (+ my patches). >> >> Then I measured the time it takes that QEMU spends in kvm_commit (ie in >> memslot updates) while booting a VM. In other words, if kvm_commit takes >> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is >> not called anymore after boot, so this measurement is easy to compare >> over multiple invocations of QEMU. >> >> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU >> command is the same to replicate the bug: >> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm >> --display none >> ~/smp_$v; >> >> Each boot is reproduced 100 times, and then from results I measure >> average and stddev (in milliseconds). >> >> ioctl: >> -smp 1: Average: 2.1ms Stdev: 0.8ms >> -smp 2: Average: 2.5ms Stdev: 1.5ms >> -smp 4: Average: 2.2ms Stdev: 1.1ms >> -smp 8: Average: 2.4ms Stdev: 0.7ms >> -smp 16: Average: 3.6ms Stdev: 2.4ms (1000 repetitions) >> -smp 24: Average: 12.5ms Stdev: 0.9ms (1000 repetitions) >> >> >> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot) >> -smp 1: Average: 2.2ms Stdev: 1.2ms >> -smp 2: Average: 3.0ms Stdev: 1.4ms >> -smp 4: Average: 3.1ms Stdev: 1.3m >> -smp 8: Average: 3.4ms Stdev: 1.4ms >> -smp 16: Average: 12ms Stdev: 7.0ms (1000 repetitions) >> -smp 24: Average: 20ms Stdev: 7.3ms (1000 repetitions) >> >> >> Above 24 vCPUs performance gets worse quickly but I think it's already >> quite clear that the results for ioctl scale better as the number of >> vcpus increases, while pausing the vCPUs becomes slower already with 16 >> vcpus. > > Right, the question is if it happens sufficiently enough that we even > care and if there are not ways to mitigate. > > It doesn't necessarily have to scale with the #VCPUs I think. What > should dominate the overall time in theory how long it takes for one > VCPU (the slowest one) to leave the kernel. > > I wondered if > > 1) it might be easier to have a single KVM mechanism/call to kick all > VCPUs out of KVM instead of doing it per VCPU. That might speed up > things eventually heavily already. So if I understand correclty, this implies creating a new ioctl in KVM anyways? What would be then the difference with what I do? We are affecting the kernel anyways. > > 2) One thing I wondered is whether the biggest overhead is actually > taking the locks in QEMU and not actually waiting for the VCPUs. Maybe > we could optimize that as well. (for now I use one lock per VCPU because > it felt like it would reduce the ioctl overhead; maybe there is a better > alternative to balance between both users) > > So treat my patch as a completely unoptimized version. > For what is worth, also my version performs #invalidate+1 swaps, which is not optimized. Honestly, I don't see how the above is easier or simpler than what is being proposed here. Thank you, Emanuele