Re: Possible race condition during lock_count increment in srcu_read_lock

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 29 May 2017 18:38:19 +0200

On 29/05/2017 17:50, Paul E. McKenney wrote:
> On Mon, May 29, 2017 at 05:05:47PM +0200, Paolo Bonzini wrote:
>> Adding Paul McKenney...
>>
>> On 29/05/2017 16:30, Linu Cherian wrote:
>>> In KVM irq injection path, irqfd_wakeup in interrupt context
>>> calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
>>> in process context also calls srcu_read_lock inside kvm_set_irq. 
>>>
>>> This can lead to race condition while incrementing the
>>> lock_count(using __this_cpu_inc), since atomic instructions being not
>>> used. 
>>>
>>> Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
>>> problem which backs up this analysis.
>>
>> Paul, how would you feel about either using this_cpu_inc in
>> __srcu_read_lock and __srcu_read_unlock?  Many architectures already
>> provide efficient this_cpu_inc, and any architecture that has LL/SC
>> could provide it easily.
> 
> You actually are not supposed to call __srcu_read_lock() and
> __srcu_read_unlock() from irq context.  (Unless you know that you
> interrupted some context that is guaranteed never to use the srcu_struct
> in question or some such.)
> 
> Given that this is the KVM irq injection path, I am guessing that it
> is executed quite frequently, so that you do need a low-cost highly
> scalable solution.  However, you only really need to exclude between
> process level and the various interrupts, so there are three ways
> that an architecture could handle this:
> 
> 1.	Single memory-to-memory increment instructions, as you say.
> 
> 2.	Atomic read-modify-write instructions, again as you say.
> 
> 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
> 
> Different architectures would want different approached, depending on what
> instructions are available and the relative speed of these instructions.
> It would not be hard to make this an architecture choice on the one
> hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
> on the other.  But I recently got some serious pushback on RCU API
> proliferation on the one hand and SRCU in particular on the other.  So
> it might be good to look at some alternatives as well.

Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
used in srcu_read_unlock, and because it can fall back to (3)
automatically.  Relaxed atomics weren't much of a proposal, rather a way
to implement this_cpu_inc on some architectures that do not support it
yet (notably 32-bit ARM, MIPS and PPC).

The disadvantage would be slowing down SRCU on machines where neither
(1) nor (2) are possible, and local_irq_save/restore is slower than
disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
into my own helpers that take care of saving/restoring the interrupt
state, like

    #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
    #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)

This would be very simple, confined into KVM and respectful of the SRCU
API requirements.

> But first, one question...  Does the entire interrupt run in the guest in
> interrupt context in the host?

It's interrupt context in the host.

> Or is there some sort of wakeup sequence
> where the guest's interrupt handler runs in process context from the
> host's perspective?  From a quick glance at the code, it looks like
> the guest's interrupt handler runs in process context (from the host's
> perspective) and that the SRCU read-side critical section is confined
> to the host's interrupt handler.

Correct.

> Another (stupid) question...  Can the process-level uses of ->irq_srcu
> instead use ->srcu?

I suppose you would synchronize_srcu on both in the write side?
Probably not because because ->srcu has really really long read-side
critical sections.  Basically the entire EQS of RCU is a read-side
->srcu critical section.

Unfortunately the write side of ->irq_srcu is also relatively
performance-critical, and since ->irq_srcu read-side critical sections
are very short, having a separate SRCU instance solved the issue quite
nicely.  Originally Christian Borntraeger had proposed turning the
relevant synchronize_rcu into synchronize_rcu_expedited, but as you said
it's not good for real-time so here we are.

Paolo

> As the code is written now, it is OK to use
> srcu_read_lock() and srcu_read_unlock() from interrupt handlers as
> long as they are never used from either process level or from nested
> interrupt handlers.  Does that work for you?
> 
> If this is the case, another approach would be for updates to wait for
> both an SRCU grace period and an RCU-sched grace period, perhaps with
> the latter expedited, though that won't make the real-time guys happy.
> 
> 							Thanx, Paul
> 
>> Thanks,
>>
>> Paolo
>>
>>> Possible solution
>>> -----------------
>>> One way is to avoid the srcu_read_lock/unlock usage in irq context.
>>> In arm/arm64 architecture, 	
>>> if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
>>>       false) == -EWOULDBLOCK)
>>> in irqfd_wakefup is always true and hence schedule_work can be called
>>> directly
>>>
>>> @@ -195,6 +195,11 @@ int __attribute__((weak))
>>> kvm_arch_set_irq_inatomic(
>>>         int idx;
>>>  
>>>         if (flags & POLLIN) {
>>> +               if (kvm_arch_set_irq_will_block_always()) {
>>> +                       schedule_work(&irqfd->inject);
>>> +                       goto skiplock;
>>> +               }
>>> +
>>>                 idx = srcu_read_lock(&kvm->irq_srcu);
>>>                 do {
>>>                         seq =
>>> 			read_seqcount_begin(&irqfd->irq_entry_sc);
>>> @@ -208,6 +213,7 @@ int __attribute__((weak))
>>> kvm_arch_set_irq_inatomic(
>>>                 srcu_read_unlock(&kvm->irq_srcu, idx);
>>>         }
>>>  
>>> +skiplock:
>>>         if (flags & POLLHUP) {
>>>                 /* The eventfd is closing, detach from KVM */
>>>                 unsigned long flags;
>>>
>>> This works without giving any warnings as well.
>>>
>>> Is a patch welcome in that direction ?
>>>
>>> Appreicate your feedback on this.
>>>
>>>
>>>
>>
>