On Wed, Jan 16, 2019 at 8:58 AM Ulf Hansson <ulf.hansson@xxxxxxxxxx> wrote: > > On Fri, 11 Jan 2019 at 12:07, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote: > > > > On Thursday, November 29, 2018 6:46:36 PM CET Ulf Hansson wrote: > > > From: Lina Iyer <lina.iyer@xxxxxxxxxx> > > > > > > Knowing the sleep duration of CPUs, is known to be needed while selecting > > > the most energy efficient idle state for a CPU or a group of CPUs. > > > > > > However, to be able to compute the sleep duration, we need to know at what > > > time the next expected wakeup is for the CPU. Therefore, let's export this > > > information via a new function, tick_nohz_get_next_wakeup(). Following > > > changes make use of it. > > > > > > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > > Cc: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> > > > Cc: Lina Iyer <ilina@xxxxxxxxxxxxxx> > > > Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx> > > > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > > > Signed-off-by: Lina Iyer <lina.iyer@xxxxxxxxxx> > > > Co-developed-by: Ulf Hansson <ulf.hansson@xxxxxxxxxx> > > > Signed-off-by: Ulf Hansson <ulf.hansson@xxxxxxxxxx> > > > --- > > > > > > Changes in v10: > > > - Updated function header of tick_nohz_get_next_wakeup(). > > > > > > --- > > > include/linux/tick.h | 8 ++++++++ > > > kernel/time/tick-sched.c | 13 +++++++++++++ > > > 2 files changed, 21 insertions(+) > > > > > > diff --git a/include/linux/tick.h b/include/linux/tick.h > > > index 55388ab45fd4..e48f6b26b425 100644 > > > --- a/include/linux/tick.h > > > +++ b/include/linux/tick.h > > > @@ -125,6 +125,7 @@ extern bool tick_nohz_idle_got_tick(void); > > > extern ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next); > > > extern unsigned long tick_nohz_get_idle_calls(void); > > > extern unsigned long tick_nohz_get_idle_calls_cpu(int cpu); > > > +extern ktime_t tick_nohz_get_next_wakeup(int cpu); > > > extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time); > > > extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time); > > > > > > @@ -151,6 +152,13 @@ static inline ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next) > > > *delta_next = TICK_NSEC; > > > return *delta_next; > > > } > > > + > > > +static inline ktime_t tick_nohz_get_next_wakeup(int cpu) > > > +{ > > > + /* Next wake up is the tick period, assume it starts now */ > > > + return ktime_add(ktime_get(), TICK_NSEC); > > > +} > > > + > > > static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; } > > > static inline u64 get_cpu_iowait_time_us(int cpu, u64 *unused) { return -1; } > > > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > > index 69e673b88474..7a9166506503 100644 > > > --- a/kernel/time/tick-sched.c > > > +++ b/kernel/time/tick-sched.c > > > @@ -1089,6 +1089,19 @@ unsigned long tick_nohz_get_idle_calls(void) > > > return ts->idle_calls; > > > } > > > > > > +/** > > > + * tick_nohz_get_next_wakeup - return the next wake up of the CPU > > > + * @cpu: the particular CPU to get next wake up for > > > + * > > > + * Called for idle CPUs only. > > > + */ > > > +ktime_t tick_nohz_get_next_wakeup(int cpu) > > > +{ > > > + struct clock_event_device *dev = per_cpu(tick_cpu_device.evtdev, cpu); > > > + > > > + return dev->next_event; > > > +} > > > + > > > static void tick_nohz_account_idle_ticks(struct tick_sched *ts) > > > { > > > #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE > > > > > > > Well, I have concerns regarding this one. > > > > I don't believe it is valid to call this new function for non-idle CPUs and > > the kerneldoc kind of says so, but the next patch doesn't actually prevent > > it from being called for a non-idle CPU (at the time it is called in there > > the target CPU may not be idle any more AFAICS). > > You are correct, but let me clarify things. > > We are calling this new API from the new genpd governor, which may > have a cpumask indicating there is more than one CPU attached to its > PM domain+sub-PM domains. In other words, we may call the API for > another CPU than the one we are executing on. > > When the new genpd governor is called, all CPUs in the cpumask of the > genpd in question, are already runtime suspended and will remain so > throughout the decisions made by the governor. > > However, because of the race condition, which needs to be manged by > the genpd backend driver and its corresponding FW, one of the CPU in > the genpd cpumask could potentially wake up from idle when the genpd > governor runs. However, as a part of exiting from idle, that CPU needs > to wait for the call to pm_runtime_get_sync() to return before > completing the exit patch of idle. This also means waiting for the > genpd governor to finish. OK, so the CPU spins on a spin lock inside of the idle loop with interrupts off. > The point is, no matter what decision the governor takes under these > circumstances, the genpd backend driver and its FW must manage this > race condition during the last man standing. For PSCI OSI mode, it > means that if a cluster idle state is suggested by Linux during these > circumstances, it must be prevented and aborted. I would suggest putting a comment to explain that somewhere as it is not really obvious. > > > > In principle, the cpuidle core can store this value, say in struct > > cpuidle_device of the given CPU, and expose a helper to access it from > > genpd, but that would be extra overhead totally unnecessary on everthing > > that doesn't use genpd for cpuidle. > > > > So maybe the driver could store it in its ->enter callback? After all, > > the driver knows that genpd is going to be used later. > > This would work, but it wouldn't really change much when it comes to > the race condition described above. No, it wouldn't make the race go away. > Of course it would turn the code > into being more cpuidle specific, which seems reasonable to me. > > Anyway, if I understand your suggestion, in principle it means > changing $subject patch in such way that the API should not take "int > cpu" as an in-parameter, but instead only use __this_cpu() to read out > the next event for current idle CPU. Yes. > Additionally, we need another new cpuidle API, which genpd can call to > retrieve a new per CPU "next event data" stored by the cpuidle driver > from its ->enter() callback. Is this a correct interpretation of your > suggestion? Yes, it is. Generally, something like "cpuidle, give me the wakeup time of this CPU". And it may very well give you 0 if the CPU has woken up already. :-)