Quoting Amit Pundir (2024-08-05 03:43:14) > On Sat, 3 Aug 2024 at 06:29, Stephen Boyd <swboyd@xxxxxxxxxxxx> wrote: > > > > Also please send back the dmesg so we can see what clks are configured > > for at boot time. If they're using TCXO source at boot then they're not > > going to be broken. In which case those clks can keep using the old clk > > ops and we can focus on the ones that aren't sourcing from TCXO. > > Thank your for this debug patch. I thought I narrowed down the > breakage to the clks in drivers/clk/qcom/gcc-sm8550.c, until I ran > into the following kernel panic in ucsi_glink driver in later test > runs. Thanks for the info. These are the clks that aren't sourcing from XO at registration time: gcc_qupv3_wrap1_s7_clk_src with cfg 0x102601 -> parent is gpll0_out_even gcc_ufs_phy_axi_clk_src with cfg 0x103 -> parent is gpll0_out_main gcc_ufs_phy_ice_core_clk_src with cfg 0x503 -> parent is gpll4_out_main gcc_ufs_phy_unipro_core_clk_src with cfg 0x103 -> parent is gpll0_out_main gcc_usb30_prim_master_clk_src with cfg 0x105 -> parent is gpll0_out_main The original patch is going to inform the clk framework that the parent of these clks aren't XO but something like gpll0_out_even, whatever the hardware is configured for. That may cause these PLLs to be turned off earlier than before if, for example, gcc_ufs_phy_axi_clk_src is turned off by a consumer and gcc_usb30_prim_master_clk_src is left enabled at boot. That's why we force park clks at registration time, so that they can't have their parent clk get turned off by some other clk consumer enabling and then disabling a clk that's also parented to the same parent. This same problem exists for RCGs that aren't shared too, but it's particularly bad for shared RCGs because the parent PLLs aren't turned on automatically by the hardware when things like the GDSC do their housekeeping. At least when software is in control we can enable the parent PLL and unstick the RCG that was previously cut off. Can you narrow down the list above to the clk that matters? I guess if USB isn't working then gcc_usb30_prim_master_clk_src is the one that should be changed and nothing else. Although, I noticed that in the first dmesg log you sent the serial console had garbage, and that's likely because the rate changed while the clk was registered. I don't know why the gcc_qupv3_wrap1_s7_clk_src is marked with the shared clk ops. That's confusing to me as I don't expect that to need to be parked for any reasons. Maybe qcom folks can comment there but I'd expect plain rcg2_ops to be used for those clks. Anyway, if you can narrow down to which clk needs to be left untouched it would be helpful. > > [ 7.882923][ T1] init: Loading module /lib/modules/ucsi_glink.ko > with args '' > [ 7.892929][ T92] Unable to handle kernel NULL pointer > dereference at virtual address 0000000000000010 > [ 7.894935][ T1] init: Loaded kernel module /lib/modules/ucsi_glink.ko > [ 7.902670][ T92] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000886218000 > [ 7.902674][ T92] Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP > [ 7.993995][ T64] qcom_pmic_glink pmic-glink: Failed to create > device link (0x180) with a600000.usb > [ 8.078673][ T92] CPU: 7 UID: 0 PID: 92 Comm: kworker/7:2 > Tainted: G S E 6.11.0-rc2-mainline-00001-g4153d980358d > #6 > [ 8.078676][ T92] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE > [ 8.078677][ T92] Hardware name: Qualcomm Technologies, Inc. > SM8550 HDK (DT) > [ 8.078679][ T92] Workqueue: events pmic_glink_ucsi_register [ucsi_glink] > [ 8.078682][ T92] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT > -SSBS BTYPE=--) > [ 8.078684][ T92] pc : pmic_glink_send+0x10/0x2c [pmic_glink] > [ 8.078685][ T92] lr : pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink] > [ 8.078704][ T92] Call trace: > [ 8.078705][ T92] pmic_glink_send+0x10/0x2c [pmic_glink] > [ 8.078706][ T92] pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink] > [ 8.078707][ T92] pmic_glink_ucsi_read_version+0x20/0x30 [ucsi_glink] > [ 8.078708][ T92] ucsi_register+0x28/0x70 > [ 8.078717][ T92] pmic_glink_ucsi_register+0x18/0x28 [ucsi_glink] > [ 8.078718][ T92] process_one_work+0x184/0x2e8 > [ 8.078723][ T92] worker_thread+0x2f0/0x404 > [ 8.078725][ T92] kthread+0x114/0x118 > [ 8.078728][ T92] ret_from_fork+0x10/0x20 > [ 8.078732][ T92] ---[ end trace 0000000000000000 ]--- > [ 8.078734][ T92] Kernel panic - not syncing: Oops: Fatal exception > [ 8.078735][ T92] SMP: stopping secondary CPUs > [ 8.279136][ T92] Kernel Offset: 0x14d9480000 from 0xffffffc080000000 > [ 8.279141][ T92] PHYS_OFFSET: 0x80000000 > [ 8.279143][ T92] CPU features: 0x18,004e0003,80113128,564676af > [ 8.279148][ T92] Memory Limit: none That looks like 'client' is NULL in pmic_glink_send(). The VA of 0x10 is the offset of 'pg' in struct pmic_glink_client. I don't know much about that driver but I'd guess that ucsi_glink has some race condition assigning the client pointer? Oh actually, I see the problem. devm_pmic_glink_register_client() returns a struct pmic_glink_client pointer that's assigned to 'ucsi->client'. And pmic_glink_ucsi_read() uses 'ucsi->client' to call pmic_glink_send(). That pointer is NULL because the workqueue that runs pmic_glink_ucsi_register() must run before devm_pmic_glink_register_client() returns and assigns the client pointer to 'ucsi->client'. This is simply a race. CPU0 CPU1 ---- ---- ucsi->client = NULL; devm_pmic_glink_register_client() client->pdr_notify(client->priv, pg->client_state) pmic_glink_ucsi_pdr_notify() schedule_work(&ucsi->register_work) <schedule away> pmic_glink_ucsi_register() ucsi_register() pmic_glink_ucsi_read_version() pmic_glink_ucsi_read() pmic_glink_ucsi_read() pmic_glink_send(ucsi->client) <client is NULL BAD> ucsi->client = client // Too late! > > I couldn't reproduce this kernel panic on vanilla v6.11-rc2 in 50+ > test runs after that. So I'm assuming that this debug patch may have > triggered it. > Attaching the crashing and working dmesg logs with the debug patch applied. > Sounds like you just need to reboot a bunch! Or add an msleep() in devm_pmic_glink_register_client() after the notify call to open the race window and let the workqueue run.