Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 18 Aug 2014 20:47:37 +0200

Il 18/08/2014 18:35, Xiao Guangrong ha scritto:
> 
> Hi Paolo,
> 
> Thank you to review the patch!
> 
> On Aug 18, 2014, at 9:57 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> 
>> Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>>> -	update_memslots(slots, new, kvm->memslots->generation);
>>> +	/* ensure generation number is always increased. */
>>> +	slots->generation = old_memslots->generation;
>>> +	update_memslots(slots, new);
>>> 	rcu_assign_pointer(kvm->memslots, slots);
>>> 	synchronize_srcu_expedited(&kvm->srcu);
>>> +	slots->generation++;
>>
>> I don't trust my brain enough to review this patch.
> 
> Sorry to make you confused. I should expain it more clearly.

Don't worry, it's not your fault. :)

>> kvm_current_mmio_generation seems like a very bad (race-prone) API.  One
>> patch I trust myself reviewing would change a bunch of functions in
>> kvm_main.c to take a memslots struct.  This would make it easy to
>> respect the hard and fast rule of not dereferencing the same pointer
>> twice.  But it would be a tedious change.
> 
> kvm_set_memory_region is the only place updating memslot and
> kvm_current_mmio_generation accesses memslot by rcu-dereference,
> i do not know why other places need to take into account.

The race occurs because gfn_to_pfn_many_atomic or some other function
has already used kvm_memslots().  Calling kvm_memslots() twice is the
root cause the bug.

> I think this patch is auditable, page-fault is always called by holding
> srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
> Only these cases can happen:
> 
> 1)  page fault occurs before synchronize_srcu_expedited.
>     In this case, vcpu will generate mmio-exit for the memslot being registered
>     by the ioctl. That’s ok since the ioctl have not finished.
> 
> 2) page fault occurs after synchronize_srcu_expedited and during
>    increasing generation-number.
>    In this case, userspace may get wrong mmio-exit (that happen if handing
>    page-fault is slower that the ioctl), that’s ok too since userspace need do
>   the check anyway like i said above.
> 
> 3) page fault occurs after generation-number update
>    that’s definitely correct. :)
> 
>> Another alternative could be to use the low bit to mark an in-progress
>> change, and skip the caching if the low bit is set.  Similar to a
>> seqcount (except if read_seqcount_retry fails, we just punt and not
>> retry anything), you could use it even though the memory barriers
>> provided by write_seqcount_begin/end are not too useful in this case.
> 
> I do not know how the bit works, page fault will cache the memslot before
> the bit set and cache the generation-number after the bit set.
> 
> Maybe i missed your idea, could you please detail it?

Something like this:

-	update_memslots(slots, new, kvm->memslots->generation);
+	/* ensure generation number is always increased. */
+	slots->generation = old_memslots->generation + 1;
+	update_memslots(slots, new);
	rcu_assign_pointer(kvm->memslots, slots);
	synchronize_srcu_expedited(&kvm->srcu);
+	slots->generation++;

Then case 1 and 2 will just have a cache miss.

The "low bit" is really just because each slot update does 2 generation
increases.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html