Re: [PATCH 1/2] KVM: Optimize kvm_make_vcpus_request_mask() a bit

Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> · Mon, 23 Aug 2021 10:03:46 +0200

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Fri, Aug 20, 2021, Vitaly Kuznetsov wrote:
>> Iterating over set bits in 'vcpu_bitmap' should be faster than going
>> through all vCPUs, especially when just a few bits are set.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
>> ---
>>  virt/kvm/kvm_main.c | 49 +++++++++++++++++++++++++++++----------------
>>  1 file changed, 32 insertions(+), 17 deletions(-)
>> 
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 3e67c93ca403..0f873c5ed538 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -257,34 +257,49 @@ static inline bool kvm_kick_many_cpus(const struct cpumask *cpus, bool wait)
>>  	return true;
>>  }
>>  
>> +static void kvm_make_vcpu_request(struct kvm *kvm, struct kvm_vcpu *vcpu,
>> +				  unsigned int req, cpumask_var_t tmp)
>> +{
>> +	int cpu = vcpu->cpu;
>
> This reminds me, syzbot found a data race a while back[*] in kvm_vcpu_kick()
> related to reading vcpu->cpu.  That race is benign, but legitimate.  I believe
> this code has a similar race, and I'm not as confident that it's benign.
>
> If the target vCPU changes vcpu->cpu after it's read by this code, then the IPI
> can sent to the wrong pCPU, e.g. this pCPU gets waylaid by an IRQ and the target
> vCPU is migrated to a new pCPU.
>
> The TL;DR is that the race is benign because the target vCPU is still guaranteed
> to see the request before entering the guest, even if the IPI goes to the wrong
> pCPU.  I believe the same holds true for KVM_REQUEST_WAIT, e.g. if the lockless
> shadow PTE walk gets migrated to a new pCPU just before setting vcpu->mode to
> READING_SHADOW_PAGE_TABLES, this code can use a stale "cpu" for __cpumask_set_cpu().
> The race is benign because the vCPU would have to enter READING_SHADOW_PAGE_TABLES
> _after_ the SPTE modifications were made, as vcpu->cpu can't change while the vCPU
> is reading SPTEs.  The same logic holds true for the case where the vCPU is migrated
> after the call to __cpumask_set_cpu(); the goal is to wait for the vCPU to return to
> OUTSIDE_GUEST_MODE, which is guaranteed if the vCPU is migrated even if this path
> doesn't wait for an ack from the _new_ pCPU.
>
> I'll send patches to fix the races later today, maybe they can be folded into
> v2?  Even though the races are benign, I think they're worth fixing, if only to
> provide an opportunity to document why it's ok to send IPIs to the
> wrong pCPU.

You're blazingly fast as usual :-) I'll do v2 on top of your patches.

>
> [*] On an upstream kernel, but I don't think the bug report was posted to LKML.
>
>> +
>> +	kvm_make_request(req, vcpu);
>> +
>> +	if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
>> +		return;
>> +
>> +	if (tmp != NULL && cpu != -1 && cpu != raw_smp_processor_id() &&
>
> For large VMs, might be worth keeping get_cpu() in the caller in passing in @me?

The only reason against was that I've tried keeping the newly introduced
kvm_make_vcpu_request()'s interface nicer, like it can be reused some
day. Will get back to get_cpu() in v2.

>
>> +	    kvm_request_needs_ipi(vcpu, req))
>> +		__cpumask_set_cpu(cpu, tmp);
>> +}
>> +
>>  bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
>>  				 struct kvm_vcpu *except,
>>  				 unsigned long *vcpu_bitmap, cpumask_var_t tmp)
>>  {
>> -	int i, cpu, me;
>> +	int i;
>>  	struct kvm_vcpu *vcpu;
>>  	bool called;
>>  
>> -	me = get_cpu();
>> -
>> -	kvm_for_each_vcpu(i, vcpu, kvm) {
>> -		if ((vcpu_bitmap && !test_bit(i, vcpu_bitmap)) ||
>> -		    vcpu == except)
>> -			continue;
>> -
>> -		kvm_make_request(req, vcpu);
>> -		cpu = vcpu->cpu;
>> -
>> -		if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
>> -			continue;
>> +	preempt_disable();
>>  
>> -		if (tmp != NULL && cpu != -1 && cpu != me &&
>> -		    kvm_request_needs_ipi(vcpu, req))
>> -			__cpumask_set_cpu(cpu, tmp);
>> +	if (likely(vcpu_bitmap)) {
>
> I don't think this is actually "likely".  kvm_make_all_cpus_request() is by far
> the most common caller and does not pass in a vcpu_bitmap.  Practically speaking
> I highly don't the code organization will matter, but from a documentation
> perspective it's wrong.

Right, I was thinking more about two other users: IOAPIC and Hyper-V who
call kvm_make_vcpus_request_mask() directly but I agree that
kvm_make_all_cpus_request() is probably much more common.

>
>> +		for_each_set_bit(i, vcpu_bitmap, KVM_MAX_VCPUS) {
>> +			vcpu = kvm_get_vcpu(kvm, i);
>> +			if (!vcpu || vcpu == except)
>> +				continue;
>> +			kvm_make_vcpu_request(kvm, vcpu, req, tmp);
>> +		}
>> +	} else {
>> +		kvm_for_each_vcpu(i, vcpu, kvm) {
>> +			if (vcpu == except)
>> +				continue;
>> +			kvm_make_vcpu_request(kvm, vcpu, req, tmp);
>> +		}
>>  	}
>>  
>>  	called = kvm_kick_many_cpus(tmp, !!(req & KVM_REQUEST_WAIT));
>> -	put_cpu();
>> +
>> +	preempt_enable();
>>  
>>  	return called;
>>  }
>> -- 
>> 2.31.1
>> 
>

-- 
Vitaly