Re: [PATCH v2] PCI: qcom-ep: Enable controller resources like PHY only after refclk is available

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Thu, 29 Aug 2024 07:38:08 -0500

On Thu, Aug 29, 2024 at 11:07:20AM +0530, Manivannan Sadhasivam wrote:
> On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> > On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > > enables the controller resources like clocks, regulator, PHY. On one of the
> > > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > > whole SoC crash on the new SoC.
> > > 
> > > qcom_pcie_enable_resources() is already called by
> > > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > > available at that time.
> > > 
> > > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > > qcom_pcie_ep_probe() to prevent the crash.
> > > 
> > > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@xxxxxxxxxx>
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
> > > ---
> > > 
> > > Changes in v2:
> > > 
> > > - Changed the patch description to mention the crash clearly as suggested by
> > >   Bjorn
> > 
> > Clearly mentioning the crash as rationale for the change is *part* of
> > what I was looking for.
> > 
> > The rest, just as important, is information about what sort of crash
> > this is, because I hope and suspect the crash is recoverable, and we
> > *should* recover from it because PERST# may occur at arbitrary times,
> > so trying to avoid it is never going to be reliable.
> 
> I did mention 'whole SoC crash' which typically means unrecoverable
> state as the SoC would crash (not just the driver). On Qcom SoCs,
> this will also lead the SoC to boot into EDL (Emergency Download)
> mode so that the users can collect dumps on the crash.

IIUC we're talking about an access to a PHY register, and the access
requires Refclk from the host.  I assume the SoC accesses the register
by doing an MMIO load.  If nothing responds, I assume the SoC would
take a machine check or similar because there's no data to complete
the load instruction.  So I assume again that the Linux on the SoC
doesn't know how to recover from such a machine check?  If that's the
scenario, is the machine check unrecoverable in principle, or is it
potentially recoverable but nobody has done the work to do it?  My
guess would be the latter, because the former would mean that it's
impossible to build a robust endpoint around this SoC.  But obviously
this is all complete speculation on my part.

Bjorn