Re: [PATCH RFC] KVM: MMU: Don't use RCU for lockless shadow walking

Avi Kivity <avi@xxxxxxxxxx> · Tue, 24 Apr 2012 12:19:59 +0300

On 04/24/2012 09:37 AM, Xiao Guangrong wrote:
> On 04/24/2012 12:16 AM, Avi Kivity wrote:
>
> > Using RCU for lockless shadow walking can increase the amount of memory
> > in use by the system, since RCU grace periods are unpredictable.  We also
> > have an unconditional write to a shared variable (reader_counter), which
> > isn't good for scaling.
> > 
> > Replace that with a scheme similar to x86's get_user_pages_fast(): disable
> > interrupts during lockless shadow walk to force the freer
> > (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
> > processor with interrupts enabled.
> > 
> > We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
> > kvm_flush_remote_tlbs() from avoiding the IPI.
> > 
> > Signed-off-by: Avi Kivity <avi@xxxxxxxxxx>
> > ---
> > 
> > Turned out to be simpler than expected.  However, I think there's a problem
> > with make_all_cpus_request() possible reading an incorrect vcpu->cpu.
>
>
> It seems possible.
>
> Can we fix it by reading vcpu->cpu when the vcpu is in GUEST_MODE or
> EXITING_GUEST_MODE (IIRC, in these modes, interrupt is disabled)?
>
> Like:
>
> if (kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>                       cpumask_set_cpu(vcpu->cpu, cpus);

I think it is actually okay.  We are only vulnerable if lockless shadow
walk started during prepare_zap_page(), and extends past
kvm_flush_remote_tlbs(), yes?  But in that case, vcpu->cpu is stable
since local_irq_disable() kills preemption.

> > 
> >  static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
> >  {
> > -	/* Decrease the counter after walking shadow page table finished */
> > -	smp_mb__before_atomic_dec();
> > -	atomic_dec(&vcpu->kvm->arch.reader_counter);
> > -	rcu_read_unlock();
>
>
> We need a mb here to avoid that setting vcpu->mode is reordered to the head
> of reading/writing spte? (it is safe on x86, but we need a comment at least?)

I don't think so.  Documentation/memory-barriers says:

Any atomic operation that modifies some state in memory and returns
information
about the state (old or new) implies an SMP-conditional general memory
barrier
(smp_mb()) on each side of the actual operation (with the exception of
explicit lock operations, described later).  These include:

        xchg();
        cmpxchg();
        atomic_cmpxchg();
        atomic_inc_return();
        atomic_dec_return();
        atomic_add_return();
        atomic_sub_return();
        atomic_inc_and_test();
        atomic_dec_and_test();
        atomic_sub_and_test();
        atomic_add_negative();
        atomic_add_unless();    /* when succeeds (returns 1) */
        test_and_set_bit();
        test_and_clear_bit();
        test_and_change_bit();

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html