On 09/10/2013 12:22 PM, Viresh Kumar wrote: > Back finally and I see lots of mails over cpufreq stuff.. :) > > On 3 September 2013 18:50, Srivatsa S. Bhat > <srivatsa.bhat@xxxxxxxxxxxxxxxxxx> wrote: >> This doesn't solve the problem completely: it prevents the store_*() task >> from continuing *only* when it concurrently executes the __cpufreq_governor() >> function along with the CPU offline task. But if the two calls don't overlap, >> we will still have the possibility where the store_*() task tries to acquire >> the timer mutex after the CPU offline task has just finished destroying it. > > How exactly? My brain is still on vacations :) > The race window is not limited to __cpufreq_governor() alone. It just starts there, but doesn't end there; it extends beyond that point. That's the main problem. And that's why serializing __cpufreq_governor() won't completely solve the issue. CPU_DOWN: __cpufreq_governor(STOP) ----+ ... | ... | RACE ... | WINDOW ... | sysfs teardown ----+ In the STOP call, we destroy the timer mutex and set cur_policy = NULL. So the incorrect access to timer mutex can occur from that point until we finally unlink or teardown the sysfs files and prevent userspace from writing to those files. Thus, this window involves stuff other than the call to __cpufreq_governor() as well. So serializing __cpufreq_governor() alone is not enough. > Anyway, this was one of the problem that I tried to solve with my patch. > But there can be other race conditions where things can go wrong and so > that patch may still be useful. > I agree, but see below (its the "may be useful" part that I'm not convinced about). > Call to __cpufreq_governor() must be serialized I believe. > Sure, its a good idea to serialize __cpufreq_governor(). However, we need specific justification for that. Just a hunch that something will go wrong is not enough, IMHO. We have to *prove* that something is indeed wrong, and only then add appropriate synchronization to fix it. Besides, even the commit message appeared a bit odd. It mentions something about problems with recursive calls. What cases are those? When do we end up recursively calling __cpufreq_governor()? And hence, why can't we use plain locks and must instead use yet another quick-and-dirty variable to protect that function? [Using variables like that is bad from a lockdep perspective as well (apart from other things) : lockdep can't catch any resulting locking issues.] Other related questions that are worth answering would be: why should we add the synchronization *inside* __cpufreq_governor()? Why not add it at its call-sites? IOW, I'm all in for fixing problems, just that I'd be comfortable with the changes only when we completely understand what problem we are fixing and why that patch is the right fix for that problem. Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe cpufreq" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html