On 21/06/2024 10:07, Geert Uytterhoeven wrote:
Hi Saravana,
On Fri, Jun 21, 2024 at 3:08 AM Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
On Wed, Jun 5, 2024 at 4:16 AM Tomi Valkeinen
<tomi.valkeinen@xxxxxxxxxxxxxxxx> wrote:
On 05/06/2024 13:53, Ulf Hansson wrote:
On Wed, 5 Jun 2024 at 12:41, Tomi Valkeinen
<tomi.valkeinen@xxxxxxxxxxxxxxxx> wrote:
On 05/06/2024 12:34, Ulf Hansson wrote:
On Mon, 27 May 2024 at 14:41, Geert Uytterhoeven
<geert+renesas@xxxxxxxxx> wrote:
Since commit a47cf07f60dcb02d ("serial: core: Call
device_set_awake_path() for console port"), the serial driver properly
handles the case where the serial console is part of the awake path, and
it looked like we could start removing special serial console handling
from PM Domain drivers like the R-Mobile SYSC PM Domain driver.
Unfortunately the devil is in the details, as usual...
Earlycon relies on the serial port to be initialized by the firmware
and/or bootloader. Linux is not aware of any hardware dependencies that
must be met to keep the port working, and thus cannot guarantee they
stay met, until the full serial driver takes over.
E.g. all unused clocks and unused PM Domains are disabled in a late
initcall. As this happens after the full serial driver has taken over,
the serial port's clock and/or PM Domain are no longer deemed unused,
and this is typically not a problem.
Let's call this "Case A".
However, if the serial port's clock or PM Domain is shared with another
device, and that other device is runtime-suspended before the full
serial driver has probed, the serial port's clock and/or PM Domain will
be disabled inadvertently. Any subsequent serial console output will
cause a crash or system lock-up. E.g. on R/SH-Mobile SoCs, the serial
ports share their PM Domain with several other I/O devices. After the
use of pwm (Armadillo-800-EVA) or i2c (KZM-A9-GT) during early boot,
before the full serial driver takes over, the PM Domain containing the
early serial port is powered down, causing a lock-up when booted with
"earlycon".
Let's call this "Case B".
Thanks for the detailed description of the problem! As pointed out in
regards to another similar recent patch [1], this is indeed a generic
problem, not limited to the serial console handling.
At Linaro Connect a few weeks ago I followed up with Saravana from the
earlier discussions at LPC last fall. We now have a generic solution
for genpd drafted on plain paper, based on fw_devlink and the
->sync_state() callback. I am currently working on the genpd series,
while Saravana will re-spin the series (can't find the link to the
last version) for the clock framework. Ideally, we want these things
to work in a very similar way.
That said, allow me to post the series for genpd in a week or two to
see if it can solve your problem too, for the serial console.
Both the genpd and the clock solutions will make suppliers depend on all
their consumers to be probed, right?
I think it is a solution, and should be worked on, but it has the
drawback that suppliers that have consumers that will possibly never be
probed, will also never be able to turn off unused resources.
This was specifically the case with the TI ti-sci pmdomain case I was
looking at: the genpd driver (ti_sci_pm_domains.c) provides a lot of
genpds for totally unrelated devices, and so if, e.g., you don't have or
don't want to load a driver for the GPU, all PDs are affected.
Even here the solutions you mention will help: instead of things getting
broken because genpds get turned off while they are actually in use, the
genpds will be kept enabled, thus fixing the breakage. Unfortunately,
they'll be kept enabled forever.
I've been ill for quite a while so I haven't had the chance to look at
this more, but before that I was hacking around a bit with something I
named .partial_sync_state(). .sync_state() gets called when all the
consumers have probed, but .partial_sync_state() gets called when _a_
consumer has been probed.
For the .sync_state() things are easy for the driver, as it knows
everything related has been probed, but for .partial_sync_state() the
driver needs to track resources internally. .partial_sync_state() will
tell the driver that a consumer device has probed, the driver can then
find out which specific resources (genpds in my case) that consumer
refers to, and then... Well, that's how far I got with my hacks =).
So, I don't know if this .partial_sync_state() can even work, but I
think we do need something more on top of the .sync_state().
Thanks for the update!
You certainly have a point, but rather than implementing some platform
specific method, I think we should be able enforce the call to
->sync_state(), based upon some condition/timeout - and even if all
consumers haven't been probed.
Hmm, I think that was already implemented in some of the serieses out
there (or even in mainline already?), as I remember doing some
experiments with it. I don't like it much, though.
With a simple timeout, it'll always be just a bit too early for some
user (nfs mount took a bit more time than expected -> board frozen).
The only condition I can see that would somewhat work is a manual
trigger from the userspace. The boot scripts could then signal the
kernel when all the modules have been loaded and probably a suitable,
platform/use case specific amount of time has passed to allow the
drivers to probe.
This is also already supported in mainline.
Devices with sync_state() implementations (once Ulf adds it) will have
a state_synced file in sysfs. It shows where it has been called yet or
not. But you can also echo 1 into it to force the sync_state()
callback (only if it hasn't been called already). So, yeah, all
methods of handling this are available if you implement the
sync_state() callback.
By default it's all strict (wait till all consumers probe
successfully). But you can set it to timeout (fw_devlink.sync_state).
And you also have the option I mentioned above that you can use with
both cases.
So the idea is to disable unused genpds and clocks from the genpd
resp. clock's driver .sync_state() callback, instead of from a late
initcall? That would indeed solve issues related to "Case A".
However, how to solve "Case B"? Ignore disabling genpds or clocks
before .sync_state() callback() has been called?
That would cause issues for cases where the clock must be disabled,
cfr.
"[PATCH RFC 0/3] Add clk_disable_unprepare_sync()"
https://lore.kernel.org/all/20240131160947.96171-1-biju.das.jz@xxxxxxxxxxxxxx/
"[PATCH v3 0/3] Add clk_poll_disable_unprepare()"
https://lore.kernel.org/linux-renesas-soc/20240318110842.41956-1-biju.das.jz@xxxxxxxxxxxxxx/
It just feels a bit too much of a "let's hope this work" approach.
That said, the timeout/condition is probably acceptable for many cases,
where turning off a resource forcefully will just result in, say, a
temporarily blanked display, or something else that gets fixed if and
when the proper driver is probed.
Unfortunately, here with the case I have, the whole board gets halted if
the display subsystem genpd is turned off and the display driver is
loaded after that.
Tomi: Do you have more details? The genpd must be controlling something
critical that must never be turned off, or perhaps the display driver
lacks some initialization?
I don't know the exact HW level details. It may be a HW bug or possibly
a firmware issue. But what I see is simple:
If the display subsystem is powered and a video output is enabled,
turning off the PD causes the display subsystem to go to a bad state,
after which the next register access will halt the cpu.
This happens quite easily if the bootloader has enabled a display: when
the kernel's tidss driver probes, the device framework will make sure
the PD is enabled (which is fine, it's basically a no-op). But if the
tidss returns an error, like EPROBE_DEFER, the device framework will
disable the PD. When the tidss driver probes again later, it will cause
a halt at a register access.
Tomi