Re: [linux-pm] [PATCH] PERF(kernel): Cleanup power events V2

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Tue, 26 Oct 2010 17:33:56 -0400

* Alan Stern (stern@xxxxxxxxxxxxxxxxxxx) wrote:
> On Tue, 26 Oct 2010, Mathieu Desnoyers wrote:
> 
> > * Peter Zijlstra (peterz@xxxxxxxxxxxxx) wrote:
> > > On Tue, 2010-10-26 at 11:56 -0500, Pierre Tardy wrote:
> > > > 
> > > > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> > > >         atomic_inc(&dev->power.usage_count); 
> > > 
> > > That's terribly racy..
> > 
> > Looking at the original code, it looks racy even without considering the
> > tracepoint:
> > 
> > int __pm_runtime_get(struct device *dev, bool sync)
> >  {
> >         int retval;
> > 
> > +       trace_runtime_pm_usage(dev, atomic_read(&dev->power.usage_count)+1);
> >         atomic_inc(&dev->power.usage_count);
> >         retval = sync ? pm_runtime_resume(dev) : pm_request_resume(dev);
> > 
> > There is no implied memory barrier after "atomic_inc". So either all these
> > inc/dec are protected with mutexes or spinlocks, in which case one might wonder
> > why atomic operations are used at all, or it's a racy mess. (I vote for the
> > second option)
> 
> I don't understand.  What's the problem?  The inc/dec are atomic 
> because they are not protected by spinlocks, but everything else is 
> (aside from the tracepoint, which is new).
> 
> > kref should certainly be used there.
> 
> What for?

kref has the following "get":

        atomic_inc(&kref->refcount);
        smp_mb__after_atomic_inc();

What seems to be missing in __pm_runtime_get() and pm_runtime_get_noresume() is
the memory barrier after the atomic increment. The atomic increment is free to
be reordered into the following spinlock (within pm_request_resume or pm_request
resume execution) because taking a spinlock only acts as a memory barrier with
acquire semantic, not a full memory barrier.

So AFAIU, the failure scenario would be as follows (sorry for the 80+ columns):

initial conditions: usage_count = 1

CPU A                                                       CPU B
1) __pm_runtime_get() (sync = true)
2)   atomic_inc(&usage_count) (not committed to memory yet)
3)   pm_runtime_resume()
4)     spin_lock_irqsave(&dev->power.lock, flags);
5)     retval = __pm_request_resume(dev);
6)     (execute the body of __pm_request_resume and return)
7)                                                          __pm_runtime_put() (sync = true) 
8)                                                          if (atomic_dec_and_test(&dev->power.usage_count))
                                                              (still see usage_count == 1 before decrement,
                                                               thus decrement to 0)
9)                                                             pm_runtime_idle()
10)  spin_unlock_irqrestore(&dev->power.lock, flags)
11)                                                            spin_lock_irq(&dev->power.lock);
12)                                                            retval = __pm_runtime_idle(dev);
13)                                                            spin_unlock_irq(&dev->power.lock);

So we end up in a situation where CPU A expects the device to be resumed, but
the last action performed has been to bring it to idle.

A smp_mb__after_atomic_inc() between lines 2 and 3 would fix this.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-trace-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html