On Friday, May 04, 2012, Colin Cross wrote: > On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki <rjw@xxxxxxx> wrote: > > On Friday, May 04, 2012, Colin Cross wrote: > >> On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki <rjw@xxxxxxx> wrote: > > [...] > >> > >> >> +/** > >> >> + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting > >> >> + * @coupled: the struct coupled that contains the current cpu > >> >> + * > >> >> + * Returns true if all cpus coupled to this target state are in the wait loop > >> >> + */ > >> >> +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled) > >> >> +{ > >> >> + int alive; > >> >> + int waiting; > >> >> + > >> >> + /* > >> >> + * Read alive before reading waiting so a booting cpu is not treated as > >> >> + * idle > >> >> + */ > >> > > >> > Well, the comment doesn't really explain much. In particular, why the boot CPU > >> > could be treated as idle if the reads were in a different order. > >> > >> Hm, I think the race condition is on a cpu going down. What about: > >> Read alive before reading waiting. If waiting is read before alive, > >> this cpu could see another cpu as waiting just before it goes offline, > >> between when it the other cpu decrements waiting and when it > >> decrements alive, which could cause alive == waiting when one cpu is > >> not waiting. > > > > Reading them in this particular order doesn't stop the race, though. I mean, > > if the hotplug happens just right after you've read alive_count, you still have > > a wrong value. waiting_count is set independently, it seems, so there's no > > ordering between the two on the "store" side and the "load" side ordering > > doesn't matter. > > As commented in the hotplug path, hotplug relies on the fact that one > of the cpus in the cluster is involved in the hotplug of the cpu that > is changing (this may not be true for multiple clusters, but it is > easy to fix by IPI-ing to a cpu that is in the same cluster when that > happens). That's very fragile and potentially sets a trap for people trying to make the kernel work on systems with multiple clusters. > That means that waiting count is always guaranteed to be at > least 1 less than alive count when alive count changes. All this read > ordering needs to do is make sure that this cpu doesn't see > waiting_count == alive_count by reading them in the wrong order. So, the concern seems to be that if the local CPU reorders the reads from waiting_count and alive_count and enough time elapses between one read and the other, the decrementation of waiting_count may happen between them and then the CPU may use the outdated value for comparison, right? Still, though, even if the barrier is there, the modification of alive_count in the hotplug notifier routine may not happen before the read from alive_count in cpuidle_coupled_cpus_waiting() is completed. Isn't that a problem? > > I would just make the CPU hotplug notifier routine block until > > cpuidle_enter_state_coupled() is done and the latter return immediately > > if the CPU hotplug notifier routine is in progress, perhaps falling back > > to the safe state. Or I would make the CPU hotplug notifier routine > > disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE > > and only re-enable it after the hotplug has been completed. > > I'll take a look at disabling coupled idle completely during hotplug. Great, thanks! > >> >> + alive = atomic_read(&coupled->alive_count); > >> >> + smp_rmb(); > >> >> + waiting = atomic_read(&coupled->waiting_count); > >> > > >> > Have you considered using one atomic variable to accommodate both counters > >> > such that the upper half contains one counter and the lower half contains > >> > the other? > >> > >> There are 3 counters (alive, waiting, and ready). Do you want me to > >> squish all of them into a single atomic_t, which would limit to 1023 > >> cpus? > > > > No. I'd make sure that cpuidle_enter_state_coupled() did't race with CPU > > hotplug, so as to make alive_count stable from its standpoint, and I'd > > put the two remaining counters into one atomic_t variable. > > I'll take a look at using a single atomic_t. My initial worry was > that the increased contention on the shared variable would cause more > cmpxchg retries, but since waiting_count and ready_count are designed > to be modified in sequential phases that shouldn't be an issue. > [...] > >> >> + while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) { > >> >> + entered_state = cpuidle_enter_state(dev, drv, > >> >> + dev->safe_state_index); > >> >> + > >> >> + local_irq_enable(); > >> >> + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask)) > >> >> + cpu_relax(); > >> > > >> > Hmm. What exactly is this loop supposed to achieve? > >> This is to ensure that the outstanding wakeups have been processed so > >> we don't go to idle with an interrupt pending an immediately wake up. > > > > I see. Is it actually safe to reenable interrupts at this point, though? > > I think so. The normal idle loop will enable interrupts in a similar > fashion to what happens here. There are two things to worry about: a > processed interrupt causing work to be scheduled that should bring > this cpu out of idle, or changing the next timer which would > invalidate the current requested state. The first is handled by > checking need_resched() after interrupts are disabled again, the > second is currently unhandled but does not affect correct operation, > it just races into a less-than-optimal idle state. I see. > >> >> + local_irq_disable(); > >> > > >> > Anyway, you seem to be calling it twice along with this enabling/disabling of > >> > interrupts. I'd put that into a separate function and explain its role in a > >> > kerneldoc comment. > >> > >> I left it here to be obvious that I was enabling interrupts in the > >> idle path, but I can refactor it out if you prefer. > > > > Well, you can call the function to make it obvious. :-) > > > > Anyway, I think that code duplication is a worse thing than a reasonable > > amount of non-obviousness, so to speak. > > > >> >> + } > >> >> + > >> >> + /* give a chance to process any remaining pokes */ > >> >> + local_irq_enable(); > >> >> + while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask)) > >> >> + cpu_relax(); > >> >> + local_irq_disable(); > >> >> + > >> >> + if (need_resched()) { > >> >> + cpuidle_coupled_set_not_waiting(dev, coupled); > >> >> + goto out; > >> >> + } > >> >> + > >> >> + /* > >> >> + * All coupled cpus are probably idle. There is a small chance that > >> >> + * one of the other cpus just became active. Increment a counter when > >> >> + * ready, and spin until all coupled cpus have incremented the counter. > >> >> + * Once a cpu has incremented the counter, it cannot abort idle and must > >> >> + * spin until either the count has hit alive_count, or another cpu > >> >> + * leaves idle. > >> >> + */ > >> >> + > >> >> + smp_mb__before_atomic_inc(); > >> >> + atomic_inc(&coupled->ready_count); > >> >> + smp_mb__after_atomic_inc(); > >> > > >> > It seems that at least one of these barriers is unnecessary ... > >> The first is to ensure ordering between ready_count and waiting count, > > > > Are you afraid that the test against waiting_count from > > cpuidle_coupled_cpus_waiting() may get reordered after the incrementation > > of ready_count or is it something else? > > Yes, ready_count must not be incremented before waiting_count == alive_count. Well, control doesn't reach the atomic_inc() statement if this condition is not satisfied, so I don't see how it can be possibly reordered before the while () loop without breaking the control flow guarantees. > >> the second is for ready_count vs. alive_count and requested_state. > > > > This one I can understand, but ... > > > >> >> + /* alive_count can't change while ready_count > 0 */ > >> >> + alive = atomic_read(&coupled->alive_count); > > > > What happens if CPU hotplug happens right here? > > According to the comment above that line that can't happen - > alive_count can't change while ready_count > 0, because that implies > that all cpus are waiting and none can be in the hotplug path where > alive_count is changed. Looking at it again that is not entirely > true, alive_count could change on systems with >2 cpus, but I think it > can't cause an issue because alive_count would be 2 greater than > waiting_count before alive_count was changed. Either way, it will be > fixed by disabling coupled idle during hotplug. Yup. > >> >> + while (atomic_read(&coupled->ready_count) != alive) { > >> >> + /* Check if any other cpus bailed out of idle. */ > >> >> + if (!cpuidle_coupled_cpus_waiting(coupled)) { > >> >> + atomic_dec(&coupled->ready_count); > >> >> + smp_mb__after_atomic_dec(); > > > > And the barrier here? Even if the old value of ready_count leaks into > > the while () loop after retry, that doesn't seem to matter. > > All of these will be academic if ready_count and waiting_count share > an atomic_t. > waiting_count must not be decremented by exiting the while loop after > the retry label until ready_count is decremented here, but that is > also protected by the barrier in set_not_waiting. One of them could > be dropped. > > >> >> + goto retry; > >> >> + } > >> >> + > >> >> + cpu_relax(); > >> >> + } > >> >> + > >> >> + /* all cpus have acked the coupled state */ > >> >> + smp_rmb(); > >> > > >> > What is the barrier here for? > >> This protects ready_count vs. requested_state. It is already > >> implicitly protected by the atomic_inc_return in set_waiting, but I > >> thought it would be better to protect it explicitly here. I think I > >> added the smp_mb__after_atomic_inc above later, which makes this one > >> superflous, so I'll drop it. > > > > OK > > > >> >> + > >> >> + next_state = cpuidle_coupled_get_state(dev, coupled); > >> >> + > >> >> + entered_state = cpuidle_enter_state(dev, drv, next_state); > >> >> + > >> >> + cpuidle_coupled_set_not_waiting(dev, coupled); > >> >> + atomic_dec(&coupled->ready_count); > >> >> + smp_mb__after_atomic_dec(); > >> >> + > >> >> +out: > >> >> + /* > >> >> + * Normal cpuidle states are expected to return with irqs enabled. > >> >> + * That leads to an inefficiency where a cpu receiving an interrupt > >> >> + * that brings it out of idle will process that interrupt before > >> >> + * exiting the idle enter function and decrementing ready_count. All > >> >> + * other cpus will need to spin waiting for the cpu that is processing > >> >> + * the interrupt. If the driver returns with interrupts disabled, > >> >> + * all other cpus will loop back into the safe idle state instead of > >> >> + * spinning, saving power. > >> >> + * > >> >> + * Calling local_irq_enable here allows coupled states to return with > >> >> + * interrupts disabled, but won't cause problems for drivers that > >> >> + * exit with interrupts enabled. > >> >> + */ > >> >> + local_irq_enable(); > >> >> + > >> >> + /* > >> >> + * Wait until all coupled cpus have exited idle. There is no risk that > >> >> + * a cpu exits and re-enters the ready state because this cpu has > >> >> + * already decremented its waiting_count. > >> >> + */ > >> >> + while (atomic_read(&coupled->ready_count) != 0) > >> >> + cpu_relax(); > >> >> + > >> >> + smp_rmb(); > >> > > >> > And here? > >> > >> This was to protect ready_count vs. looping back in and reading > >> alive_count. > > > > Well, I'm lost. :-) > > > > You've not modified anything after the previous smp_mb__after_atomic_dec(), > > so what exactly is the reordering this is supposed to work against? > > > > And while we're at it, I'm not quite sure what the things that the previous > > smp_mb__after_atomic_dec() separates from each other are. > > Instead of justifying all of these, let me try the combined atomic_t > trick and justify the (many fewer) remaining barriers. OK, cool! :-) Thanks, Rafael _______________________________________________ linux-pm mailing list linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/linux-pm