Re: [PATCH] ACPI: Remove msleep() bloat from acpi_os_sleep()

Len Brown <lenb@xxxxxxxxxx> · Wed, 28 Aug 2024 00:06:27 -0400

On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> First, let me add a few people who know more about timers than I do.
>
> On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@xxxxxxxxxx> wrote:
> >
> > From: Len Brown <len.brown@xxxxxxxxx>
> >
> > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> > The floor of the range is the exact requested ms,
> > with an additional 1ms of slack for sleeps above 20ms.
> >
> > This reduces  the kernel resume time of the Dell 9300
> > to 1,124 ms from 2,471 ms.
> >
> > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> > which has invoked msleep(ms) since 2013.
> >
> > But msleep(ms) is based on jiffies, and the rounding-up
> > logic to convert to jiffies on a HZ=250 system causes
> > msleep(5) to bloat to a minimum of a 12ms delay.
> > msleep(5) typically takes over 15ms!
> >
> > As a result, AML delay loops with small Sleep() inside
> > magnify the entire loop.  A particularly painful example
> > is ACPI support for powering-on ICL and TGL
> > thunderbolt/pcie_ports during system resume.
> >
> > Regarding jiffy-based msleep() being inexpensive
> > and hrtimer-based usleep_range() being expensive.
> > ACPI AML timer invocations are rare, and so it
> > is unlikely the hrtimer cost will be noticible,
> > or even measurable.  At the same time, the msleep()
> > timer duration bloat is significant enough to
> > be noticed by end users.
>
> I'm not sure why you are refusing to follow the implementation of
> fsleep() and Documentation/timers/timers-howto.rst and still use
> msleep() for sleep durations longer than 20 ms.

timers_howto.rst could use an update to reflect reality.

It doesn't disclose how toxic msleep actually is for small values.
msleep(1) takes 11ms on a HZ=250 system.

Linux/ACPI has to support any random AML Sleep(ms) call,
and sometimes we see timeout loops implemented around
an inner Sleep(1ms).  If we use msleep those loops explode
by 11x and aggregate durations that are noticed by end users.

fsleep does three things -- and none of them are a good fit
for acpi_os_sleep:

static inline void fsleep(unsigned long usecs)
{
        if (usecs <= 10)
                udelay(usecs);
        else if (usecs <= 20000)
                usleep_range(usecs, 2 * usecs);
        else
                msleep(DIV_ROUND_UP(usecs, 1000));
}

> udelay(usecs);
will never execute in the ACPI case, as the minimum delay is 1000 usec.

> usleep_range(usecs, 2 * usecs);

timers-howto.rst says this:

                        "With the introduction of a range, the scheduler is
                        free to coalesce your wakeup with any other wakeup
                        that may have happened for other reasons, or at the
                        worst case, fire an interrupt for your upper bound."

But the way usleep_range works is it sets the timer for the
upper bound, and opportunistically wakes/cancels if another timer
fires after the lower bound and before the upper bound.

It calls it a "worst case" that the timer fires at the upper bound.

But when ACPI suspend/resume flows are running the only other
timer is the tick, and so the "worst case" happens effectively ALWAYS.

So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively
DOUBLING the duration of all timers 20 ms and smaller.
There may be scenarios where doing this makes sense,
but acpi_os_sleep() is not one of them.

> msleep(DIV_ROUND_UP(usecs, 1000));

msleep(50) takes 59.8ms -- a 20% penalty.
We have loops with AML Sleep(50) in the middle,
and this  code would bloat them by 20%, while
the user is waiting -- for no benefit. o
Again, there may be scenarios where doing this makes sense,
but acpi_os_sleep() is not one of them.

> Why should usleep_range() be used for 100 ms sleeps, for instance?
> This goes against the recommendation in the above document, so is
> there a particular reason?

The document doesn't say *why* msleep is recommended.
One would assume that when there are many timers, msleep
is efficient because it consolidates them to jiffy boundaries,
and if they are long duration timers, perhaps the assumption is
that they don't care so much about the additional delays?

Again, there are certainly scenarios where that makes sense,
but at the end of the day, msleep(100) takes 106 msec.

ACPI is not a heavy timer user, so the msleep efficiency justification
for making the user wait longer holds no weight.

Now that we realize that end-users notice timer bloat in acp_os_sleep,
it is clear that msleep is simply a poor choice for acpi_os_sleep.

> > Regarding usleep_range() timer coalescing.
> > It virtually never works during ACPI flows, which
> > commonly run when there are few coalescing
> > opportunities. As a result, the timers almost
> > always expire at the maximum end of their specified range.
>
> I don't think that's the main point of using a nonzero range in
> usleep_range().  AFAICS, this is about letting the timekeeping
> subsystem know how much you care about timer precision so it can
> arrange things to meet everyone's needs.

The range in usleep_range is to enable timer coalescing,
which is to reduce the number of timers firing on the system.
If it has another purpose, neither the code nor the API documentation
are forthcoming.

> > It was tempting to use usleep_range(us, us)
> > for all values of us.  But 1 ms is added to the
> > range for timers over 20ms on the reasoning that
> > the AML Sleep interface has a granularity of 1ms,
> > most costly loops use duration under 20ms inside,
> > and singular long sleeps are unlitly to notice an
> > additiona 1ms, so why not allow some coalescing...
>
> So again, why not use msleep() for sleeps longer than 20 ms?

per above.  Too slow.

> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> > Signed-off-by: Len Brown <len.brown@xxxxxxxxx>
> > Suggested-by: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> > Tested-by: Todd Brandt <todd.e.brandt@xxxxxxxxx>
> > ---
> >  drivers/acpi/osl.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > index 70af3fbbebe5..c4c76f86cd7a 100644
> > --- a/drivers/acpi/osl.c
> > +++ b/drivers/acpi/osl.c
> > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> >
> >  void acpi_os_sleep(u64 ms)
> >  {
> > -       msleep(ms);
> > +       u64 us = ms * 1000;
> > +
> > +       if (us <= 20000)
> > +               usleep_range(us, us);
> > +       else
> > +               usleep_range(us, us + 1000);
> > +
> >  }
> >
> >  void acpi_os_stall(u32 us)
> > --
>
> While I agree with using usleep_range() for sleeps up to 20 ms in
> acpi_os_sleep(), I disagree with the patch as is.

The measurement results do not support any form of acpi_os_sleep that
invokes any form of msleep().

Honestly, I think because of when and how acpi_os_sleep is called, we should
consider making it yet simpler:

void acpi_os_sleep(u64 ms)
{
        u64 us = ms * 1000;

        usleep_range(us, us);
}

thanks,
-Len