Re: [PATCH 4/5] KVM: add __kvm_request_needs_mb

Christian Borntraeger <borntraeger@xxxxxxxxxx> · Wed, 22 Feb 2017 20:23:05 +0100

On 02/22/2017 04:17 PM, Radim Krčmář wrote:
> [Oops, the end of this thread got dragged into a mark-as-read spree ...]
> 
> 2017-02-17 11:13+0100, David Hildenbrand:
>>>> This is really complicated stuff, and the basic reason for it (if I
>>>> remember correctly) is that s390x does reenable all interrupts when
>>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>>> kicks don't work (as it is otherwise just racy), and if I remember
>>>> correctly, SMP reschedule signals (s390x external calls) would be
>>>> slower. (Christian, please correct me if I'm wrong)
>>>
>>> No the reason was that there are some requests that need to be handled
>>> outside run SIE. For example one reason was the guest prefix page.
>>> This must be mapped read/write ALL THE TIME when a guest is running,
>>> otherwise the host might crash. So we have to exit SIE and make sure that
>>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>>> that is called from page table functions, whenever memory management decides
>>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>>> or compaction...)
>>>
>>> SMP-based request wills kick out the guest, but for some thing like the
>>> one above it will be too late.
>>
>> While what you said is 100% correct, I had something else in mind that
>> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
>> And I remember that being related to how preemption and
>> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
>> have to be implemented in kvm_arch_vcpu_should_kick().
>>
>> x86 can track the guest state using vcpu->mode, because they can be sure
>> that the guest can't reschedule while in the critical guest entry/exit
>> section. This is not true for s390x, as preemption is enabled. That's
>> why vcpu->mode cannot be used in its current form to track if a VCPU is
>> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
>> relies on this setting.
>>
>> For now, calling vcpu_kick() on s390x will result in a BUG().
>>
>>
>> On s390x, there are 3 use cases I see for requests:
>>
>> 1. Remote requests that need a sync
>>
>> Make a request, wait until SIE has been left and make sure the request
>> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
>> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
>> candidate.
> 
> Btw. aren't those requests racy?
> 
>     void exit_sie(struct kvm_vcpu *vcpu)
>     {
>     	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
> 
> If you get stalled here and the target VCPU handles the request and
> reenters SIE in the meantime, then you'll wait until its next exit.
> (And miss an unbounded amount of exits in the worst case.)
> 
>     	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>     		cpu_relax();
>     }
> 

Its not racy for the purpose it was originally made for (get the vcpu 
out of SIE before we unmap a guest prefix page) as the MMU_RELOAD handler 
will wait for the pte lock which is held by the code that called
kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu).

We also have the guarantee that after returning from kvm_s390_sync_request
we will have that request be handled before we reenter the guest, which is
all we need for DISABLE_IBS. 

But yes, all non MMU_RELOAD users might wait longer, possibly several guest
exits. We never noticed that as requests are really a seldom event. Basically
unmapping of the guest prefix page due to paging and migration, switching 
between 1 and more guest cpus and some other seldom events.

> And out of curiosity -- how many cycles does this loop usually take?
> 
>> 2. Remote requests that don't need a sync
>>
>> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
>> KVM_REQ_DISABLE_IBS does.
> 
> A usual KVM request would kick the VCPU out of nested virt as well.
> Shouldn't it be done for these as well?
> 
>> 3. local requests
>>
>> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
>>
>>
>> Of course, having a unified interface would be better.
>>
>> /* set the request and kick the CPU out of guest mode */
>> kvm_set_request(req, vcpu);
>>
>> /* set the request, kick the CPU out of guest mode, wait until guest
>> mode has been left and make sure the request will be handled before
>> reentering guest mode */
>> kvm_set_sync_request(req, vcpu);
> 
> Sounds good, I'll also add
> 
>   kvm_set_self_request(req, vcpu);
> 
>> Same maybe even for multiple VCPUs (as there are then ways to speed it
>> up, e.g. first kick all, then wait for all)
>>
>> This would require arch specific callbacks to
>> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
>> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
>> kvm_s390_vsie_kick(vcpu) on s390x)
>> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
>>
>> This would only make sense if there are other use cases for sync
>> requests. At least I remember that Power also has a faster way for
>> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
>> s390x only thing and is better be left as is :)
>>
>> At least vcpu_kick() could be quite easily made to work on s390x.
>>
>> Radim, are there also other users that need something like sync requests?
> 
> I think that ARM has a similar need when updating vgic, but relies on an
> asumption that VCPUs are going to be out after kicking them with
> kvm_make_all_cpus_request().
> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
> 
> Having synchronous requests in a common API should probably wait for the
> completion of the request, not just for the kick, which would make race
> handling simpler.
> 
> I'm not going to worry about them in this pass, though.
> 
> Thanks.
>