Re: Extreme time jitter with suspend/resume cycles

Gabriel Beddingfield <gabe@xxxxxxxxxxxx> · Thu, 5 Oct 2017 13:51:03 -0700

Hi Thomas,

On Thu, Oct 5, 2017 at 11:01 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> > Which SoC/clocksource driver are you talking about?
>>
>> NXP i.MX 6SoloX
>> drivers/clocksource/timer-imx-gpt.c
>
> So that clocksource driver looks correct. Do you have an idea in which
> context this time jump happens? Does it happen when you exercise your high
> frequency suspend/resume dance or is that happening just when you let the
> machine run forever as well?

We couldn't devise any reproduction steps. We observed it happening at
unexpected
times in a fleet of devices -- and we couldn't find any patterns to clue us in.

>
> The timekeeping_resume() path definitely has an issue:
>
>         cycle_now = tk_clock_read(&tk->tkr_mono);
>         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
>                 cycle_now > tk->tkr_mono.cycle_last) {
>
> This works nice for clocksources which wont wrap across suspend/resume but
> not for those which can. That cycle_now -> cycle_last check should take
> cs-mask into account ...
>
> Of course for clocksources which can wrap within realistic suspend times,
> which 36 hours might be accounted for, this would need an extra sanity
> check against a RTC whether wrap time has been exceeded.
>
> I haven't thought it through whether that buggered check fully explains
> what you are observing, but it's wrong nevertheless. John?

Nah. It looks like the consequence is that you'll either fail to
inject the sleep time
or you'll fall back to having the RTC inject the sleep time. In our
case, we never
sleep for more than a couple of minutes so the error would be seconds rather
than hours.

-gabe