Re: Possible race condition during lock_count increment in srcu_read_lock

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 30 May 2017 15:49:31 +0200

On 29/05/2017 22:14, Paul E. McKenney wrote:
> On Mon, May 29, 2017 at 06:38:19PM +0200, Paolo Bonzini wrote:
>> On 29/05/2017 17:50, Paul E. McKenney wrote:
>>> You actually are not supposed to call __srcu_read_lock() and
>>> __srcu_read_unlock() from irq context.  (Unless you know that you
>>> interrupted some context that is guaranteed never to use the srcu_struct
>>> in question or some such.)

I was going to improve the SRCU doc comments, and then started digging 
into srcu.c history to find out why srcu_read_unlock uses this_cpu_inc.

It turns out that back in 2012 __srcu_read_lock did two accesses to 
percpu variables, while __srcu_read_unlock did one (a decrement).  Then, 
commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via 
this_cpu_dec()", 2012-11-29) changed __srcu_read_unlock to use 
this_cpu_dec and remove the preempt_disable/preempt_enable pair.

Nowadays, __srcu_read_lock only does one access to percpu variables so 
we could do the same as commit 5a41344a3d83 did:

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 167ad8831aaf..4c1d5f7e62c4 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
 {
 	int retval;
 
-	preempt_disable();
 	retval = __srcu_read_lock(sp);
-	preempt_enable();
 	rcu_lock_acquire(&(sp)->dep_map);
 	return retval;
 }
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 584d8a983883..dea03614263f 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -263,7 +263,7 @@ void cleanup_srcu_struct(struct srcu_struct *sp)
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->completed) & 0x1;
-	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
+	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -281,7 +281,6 @@ int __srcu_read_lock(struct srcu_struct *sp)
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {


As said earlier, this would slow down SRCU on machines where neither fast
this_cpu_inc nor atomic_inc_relaxed are available, and local_irq_save/restore
is slower than disabling preemption.  The main one is s390, which however is
already paying the price in __srcu_read_unlock.

Any objections to the above for 4.12, with commit message explaining both the
history of __srcu_read_unlock and what this fixes in KVM?

Thanks,

Paolo

>>>
>>> Given that this is the KVM irq injection path, I am guessing that it
>>> is executed quite frequently, so that you do need a low-cost highly
>>> scalable solution.  However, you only really need to exclude between
>>> process level and the various interrupts, so there are three ways
>>> that an architecture could handle this:
>>>
>>> 1.	Single memory-to-memory increment instructions, as you say.
>>>
>>> 2.	Atomic read-modify-write instructions, again as you say.
>>>
>>> 3.	Disable interrupts across __srcu_read_lock() and __srcu_read_unlock().
>>>
>>> Different architectures would want different approached, depending on what
>>> instructions are available and the relative speed of these instructions.
>>> It would not be hard to make this an architecture choice on the one
>>> hand or to provide an srcu_read_lock_irq() and srcu_read_unlock_irq()
>>> on the other.  But I recently got some serious pushback on RCU API
>>> proliferation on the one hand and SRCU in particular on the other.  So
>>> it might be good to look at some alternatives as well.
>>
>> Yeah, I liked Linu's suggestion of this_cpu_inc because it's already
>> used in srcu_read_unlock, and because it can fall back to (3)
>> automatically.  Relaxed atomics weren't much of a proposal, rather a way
>> to implement this_cpu_inc on some architectures that do not support it
>> yet (notably 32-bit ARM, MIPS and PPC).
>>
>> The disadvantage would be slowing down SRCU on machines where neither
>> (1) nor (2) are possible, and local_irq_save/restore is slower than
>> disabling preemption.  But I can certainly wrap ->irq_srcu lock/unlock
>> into my own helpers that take care of saving/restoring the interrupt
>> state, like
>>
>>     #define kvm_irq_srcu_read_lock_irqsave(kvm, flags)
>>     #define kvm_irq_srcu_read_unlock_irqrestore(kvm, flags)
>>
>> This would be very simple, confined into KVM and respectful of the SRCU
>> API requirements.
> 
> I do like this approach!  If it turns out to be needed in (say) four
> other places, then it might make sense for me to pull this into SRCU as
> an additional API.

Good.  On the other hand, I was