Re: [PATCH] clk: qcom: Park shared RCGs upon registration

Amit Pundir <amit.pundir@xxxxxxxxxx> · Mon, 12 Aug 2024 15:09:47 +0530

On Tue, 6 Aug 2024 at 03:07, Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
>
> Quoting Amit Pundir (2024-08-05 03:43:14)
> > On Sat, 3 Aug 2024 at 06:29, Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote:
> > >
> > > Also please send back the dmesg so we can see what clks are configured
> > > for at boot time. If they're using TCXO source at boot then they're not
> > > going to be broken. In which case those clks can keep using the old clk
> > > ops and we can focus on the ones that aren't sourcing from TCXO.
> >
> > Thank your for this debug patch. I thought I narrowed down the
> > breakage to the clks in drivers/clk/qcom/gcc-sm8550.c, until I ran
> > into the following kernel panic in ucsi_glink driver in later test
> > runs.
>
> Thanks for the info. These are the clks that aren't sourcing from XO
> at registration time:
>
>   gcc_qupv3_wrap1_s7_clk_src with cfg 0x102601 -> parent is gpll0_out_even
>   gcc_ufs_phy_axi_clk_src with cfg 0x103 -> parent is gpll0_out_main
>   gcc_ufs_phy_ice_core_clk_src with cfg 0x503 -> parent is gpll4_out_main
>   gcc_ufs_phy_unipro_core_clk_src with cfg 0x103 -> parent is gpll0_out_main
>   gcc_usb30_prim_master_clk_src with cfg 0x105 -> parent is gpll0_out_main
>
> The original patch is going to inform the clk framework that the parent
> of these clks aren't XO but something like gpll0_out_even, whatever the
> hardware is configured for. That may cause these PLLs to be turned off
> earlier than before if, for example, gcc_ufs_phy_axi_clk_src is turned
> off by a consumer and gcc_usb30_prim_master_clk_src is left enabled at
> boot. That's why we force park clks at registration time, so that they
> can't have their parent clk get turned off by some other clk consumer
> enabling and then disabling a clk that's also parented to the same
> parent.
>
> This same problem exists for RCGs that aren't shared too, but it's
> particularly bad for shared RCGs because the parent PLLs aren't turned
> on automatically by the hardware when things like the GDSC do their
> housekeeping. At least when software is in control we can enable the
> parent PLL and unstick the RCG that was previously cut off.

Thank you Stephen for the detailed explanation. Really appreciate this
knowledge dump.

>
> Can you narrow down the list above to the clk that matters? I guess if
> USB isn't working then gcc_usb30_prim_master_clk_src is the one that
> should be changed and nothing else. Although, I noticed that in the
> first dmesg log you sent the serial console had garbage, and that's
> likely because the rate changed while the clk was registered. I don't
> know why the gcc_qupv3_wrap1_s7_clk_src is marked with the shared clk
> ops. That's confusing to me as I don't expect that to need to be parked
> for any reasons. Maybe qcom folks can comment there but I'd expect plain
> rcg2_ops to be used for those clks. Anyway, if you can narrow down to
> which clk needs to be left untouched it would be helpful.

gcc_qupv3_wrap1_s7_clk_src and gcc_usb30_prim_master_clk_src need to
be left untouched to fix the Audio codec and USB-C host mode breakages
respectively. It seem to have fixed the serial console garbage dump
issue as well.

UFS on SM8550-HDK has it's own issues when booting AOSP from it, so I
couldn't test/verify if the reported gcc_ufs_phy_*_clk_src clk changes
make any difference. But I'll bookmark this change for future
reference if I run into any relevant UFS probe deferrals once we fix
the existing/on-going issues.

>
> >
> > [    7.882923][    T1] init: Loading module /lib/modules/ucsi_glink.ko
> > with args ''
> > [    7.892929][   T92] Unable to handle kernel NULL pointer
> > dereference at virtual address 0000000000000010
> > [    7.894935][    T1] init: Loaded kernel module /lib/modules/ucsi_glink.ko
> > [    7.902670][   T92] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000886218000
> > [    7.902674][   T92] Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
> > [    7.993995][   T64] qcom_pmic_glink pmic-glink: Failed to create
> > device link (0x180) with a600000.usb
> > [    8.078673][   T92] CPU: 7 UID: 0 PID: 92 Comm: kworker/7:2
> > Tainted: G S          E      6.11.0-rc2-mainline-00001-g4153d980358d
> > #6
> > [    8.078676][   T92] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> > [    8.078677][   T92] Hardware name: Qualcomm Technologies, Inc.
> > SM8550 HDK (DT)
> > [    8.078679][   T92] Workqueue: events pmic_glink_ucsi_register [ucsi_glink]
> > [    8.078682][   T92] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> > -SSBS BTYPE=--)
> > [    8.078684][   T92] pc : pmic_glink_send+0x10/0x2c [pmic_glink]
> > [    8.078685][   T92] lr : pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> > [    8.078704][   T92] Call trace:
> > [    8.078705][   T92]  pmic_glink_send+0x10/0x2c [pmic_glink]
> > [    8.078706][   T92]  pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> > [    8.078707][   T92]  pmic_glink_ucsi_read_version+0x20/0x30 [ucsi_glink]
> > [    8.078708][   T92]  ucsi_register+0x28/0x70
> > [    8.078717][   T92]  pmic_glink_ucsi_register+0x18/0x28 [ucsi_glink]
> > [    8.078718][   T92]  process_one_work+0x184/0x2e8
> > [    8.078723][   T92]  worker_thread+0x2f0/0x404
> > [    8.078725][   T92]  kthread+0x114/0x118
> > [    8.078728][   T92]  ret_from_fork+0x10/0x20
> > [    8.078732][   T92] ---[ end trace 0000000000000000 ]---
> > [    8.078734][   T92] Kernel panic - not syncing: Oops: Fatal exception
> > [    8.078735][   T92] SMP: stopping secondary CPUs
> > [    8.279136][   T92] Kernel Offset: 0x14d9480000 from 0xffffffc080000000
> > [    8.279141][   T92] PHYS_OFFSET: 0x80000000
> > [    8.279143][   T92] CPU features: 0x18,004e0003,80113128,564676af
> > [    8.279148][   T92] Memory Limit: none
>
> That looks like 'client' is NULL in pmic_glink_send(). The VA of 0x10 is
> the offset of 'pg' in struct pmic_glink_client. I don't know much about
> that driver but I'd guess that ucsi_glink has some race condition
> assigning the client pointer?
>
> Oh actually, I see the problem. devm_pmic_glink_register_client()
> returns a struct pmic_glink_client pointer that's assigned to
> 'ucsi->client'. And pmic_glink_ucsi_read() uses 'ucsi->client' to call
> pmic_glink_send(). That pointer is NULL because the workqueue that runs
> pmic_glink_ucsi_register() must run before
> devm_pmic_glink_register_client() returns and assigns the client pointer
> to 'ucsi->client'. This is simply a race.
>
>  CPU0                                        CPU1
>  ----                                        ----
>  ucsi->client = NULL;
>  devm_pmic_glink_register_client()
>   client->pdr_notify(client->priv, pg->client_state)
>    pmic_glink_ucsi_pdr_notify()
>     schedule_work(&ucsi->register_work)
>     <schedule away>
>                                              pmic_glink_ucsi_register()
>                                               ucsi_register()
>                                                pmic_glink_ucsi_read_version()
>                                                 pmic_glink_ucsi_read()
>                                                  pmic_glink_ucsi_read()
>                                                   pmic_glink_send(ucsi->client)
>                                                   <client is NULL BAD>
>  ucsi->client = client // Too late!
>
> >
> > I couldn't reproduce this kernel panic on vanilla v6.11-rc2 in 50+
> > test runs after that. So I'm assuming that this debug patch may have
> > triggered it.
> > Attaching the crashing and working dmesg logs with the debug patch applied.
> >
>
> Sounds like you just need to reboot a bunch! Or add an msleep() in
> devm_pmic_glink_register_client() after the notify call to open the race
> window and let the workqueue run.

Thank you for diagnosing this race in ucsi_glink. I needed to run an
overnight reboot test to reproduce this crash, and could reproduce it
on ~380th reboot. I'll check if it has already been reported or fixed
on linux-next.

Regards,
Amit Pundir