Re: [PATCH] clk: qcom: Park shared RCGs upon registration

Stephen Boyd <swboyd@xxxxxxxxxxxx> · Mon, 5 Aug 2024 14:37:16 -0700

Quoting Amit Pundir (2024-08-05 03:43:14)
> On Sat, 3 Aug 2024 at 06:29, Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
> >
> > Also please send back the dmesg so we can see what clks are configured
> > for at boot time. If they're using TCXO source at boot then they're not
> > going to be broken. In which case those clks can keep using the old clk
> > ops and we can focus on the ones that aren't sourcing from TCXO.
>
> Thank your for this debug patch. I thought I narrowed down the
> breakage to the clks in drivers/clk/qcom/gcc-sm8550.c, until I ran
> into the following kernel panic in ucsi_glink driver in later test
> runs.

Thanks for the info. These are the clks that aren't sourcing from XO
at registration time:

  gcc_qupv3_wrap1_s7_clk_src with cfg 0x102601 -> parent is gpll0_out_even
  gcc_ufs_phy_axi_clk_src with cfg 0x103 -> parent is gpll0_out_main
  gcc_ufs_phy_ice_core_clk_src with cfg 0x503 -> parent is gpll4_out_main
  gcc_ufs_phy_unipro_core_clk_src with cfg 0x103 -> parent is gpll0_out_main
  gcc_usb30_prim_master_clk_src with cfg 0x105 -> parent is gpll0_out_main

The original patch is going to inform the clk framework that the parent
of these clks aren't XO but something like gpll0_out_even, whatever the
hardware is configured for. That may cause these PLLs to be turned off
earlier than before if, for example, gcc_ufs_phy_axi_clk_src is turned
off by a consumer and gcc_usb30_prim_master_clk_src is left enabled at
boot. That's why we force park clks at registration time, so that they
can't have their parent clk get turned off by some other clk consumer
enabling and then disabling a clk that's also parented to the same
parent.

This same problem exists for RCGs that aren't shared too, but it's
particularly bad for shared RCGs because the parent PLLs aren't turned
on automatically by the hardware when things like the GDSC do their
housekeeping. At least when software is in control we can enable the
parent PLL and unstick the RCG that was previously cut off.

Can you narrow down the list above to the clk that matters? I guess if
USB isn't working then gcc_usb30_prim_master_clk_src is the one that
should be changed and nothing else. Although, I noticed that in the
first dmesg log you sent the serial console had garbage, and that's
likely because the rate changed while the clk was registered. I don't
know why the gcc_qupv3_wrap1_s7_clk_src is marked with the shared clk
ops. That's confusing to me as I don't expect that to need to be parked
for any reasons. Maybe qcom folks can comment there but I'd expect plain
rcg2_ops to be used for those clks. Anyway, if you can narrow down to
which clk needs to be left untouched it would be helpful.

>
> [    7.882923][    T1] init: Loading module /lib/modules/ucsi_glink.ko
> with args ''
> [    7.892929][   T92] Unable to handle kernel NULL pointer
> dereference at virtual address 0000000000000010
> [    7.894935][    T1] init: Loaded kernel module /lib/modules/ucsi_glink.ko
> [    7.902670][   T92] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000886218000
> [    7.902674][   T92] Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
> [    7.993995][   T64] qcom_pmic_glink pmic-glink: Failed to create
> device link (0x180) with a600000.usb
> [    8.078673][   T92] CPU: 7 UID: 0 PID: 92 Comm: kworker/7:2
> Tainted: G S          E      6.11.0-rc2-mainline-00001-g4153d980358d
> #6
> [    8.078676][   T92] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> [    8.078677][   T92] Hardware name: Qualcomm Technologies, Inc.
> SM8550 HDK (DT)
> [    8.078679][   T92] Workqueue: events pmic_glink_ucsi_register [ucsi_glink]
> [    8.078682][   T92] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> -SSBS BTYPE=--)
> [    8.078684][   T92] pc : pmic_glink_send+0x10/0x2c [pmic_glink]
> [    8.078685][   T92] lr : pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> [    8.078704][   T92] Call trace:
> [    8.078705][   T92]  pmic_glink_send+0x10/0x2c [pmic_glink]
> [    8.078706][   T92]  pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> [    8.078707][   T92]  pmic_glink_ucsi_read_version+0x20/0x30 [ucsi_glink]
> [    8.078708][   T92]  ucsi_register+0x28/0x70
> [    8.078717][   T92]  pmic_glink_ucsi_register+0x18/0x28 [ucsi_glink]
> [    8.078718][   T92]  process_one_work+0x184/0x2e8
> [    8.078723][   T92]  worker_thread+0x2f0/0x404
> [    8.078725][   T92]  kthread+0x114/0x118
> [    8.078728][   T92]  ret_from_fork+0x10/0x20
> [    8.078732][   T92] ---[ end trace 0000000000000000 ]---
> [    8.078734][   T92] Kernel panic - not syncing: Oops: Fatal exception
> [    8.078735][   T92] SMP: stopping secondary CPUs
> [    8.279136][   T92] Kernel Offset: 0x14d9480000 from 0xffffffc080000000
> [    8.279141][   T92] PHYS_OFFSET: 0x80000000
> [    8.279143][   T92] CPU features: 0x18,004e0003,80113128,564676af
> [    8.279148][   T92] Memory Limit: none

That looks like 'client' is NULL in pmic_glink_send(). The VA of 0x10 is
the offset of 'pg' in struct pmic_glink_client. I don't know much about
that driver but I'd guess that ucsi_glink has some race condition
assigning the client pointer?

Oh actually, I see the problem. devm_pmic_glink_register_client()
returns a struct pmic_glink_client pointer that's assigned to
'ucsi->client'. And pmic_glink_ucsi_read() uses 'ucsi->client' to call
pmic_glink_send(). That pointer is NULL because the workqueue that runs
pmic_glink_ucsi_register() must run before
devm_pmic_glink_register_client() returns and assigns the client pointer
to 'ucsi->client'. This is simply a race.

 CPU0                                        CPU1
 ----                                        ----
 ucsi->client = NULL;
 devm_pmic_glink_register_client()
  client->pdr_notify(client->priv, pg->client_state)
   pmic_glink_ucsi_pdr_notify()
    schedule_work(&ucsi->register_work)
    <schedule away>
                                             pmic_glink_ucsi_register()
                                              ucsi_register()
                                               pmic_glink_ucsi_read_version()
                                                pmic_glink_ucsi_read()
                                                 pmic_glink_ucsi_read()
                                                  pmic_glink_send(ucsi->client)
                                                  <client is NULL BAD>
 ucsi->client = client // Too late!

>
> I couldn't reproduce this kernel panic on vanilla v6.11-rc2 in 50+
> test runs after that. So I'm assuming that this debug patch may have
> triggered it.
> Attaching the crashing and working dmesg logs with the debug patch applied.
>

Sounds like you just need to reboot a bunch! Or add an msleep() in
devm_pmic_glink_register_client() after the notify call to open the race
window and let the workqueue run.