Re: [PATCH] PCI: qcom-ep: Do not enable resources during probe()

Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> · Sat, 24 Aug 2024 22:04:14 +0530

On Sat, Aug 24, 2024 at 11:12:34AM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 24, 2024 at 07:49:46AM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Aug 23, 2024 at 05:04:36PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 23, 2024 at 10:11:33AM +0530, Manivannan Sadhasivam wrote:
> > > > On Thu, Aug 22, 2024 at 12:31:33PM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Aug 22, 2024 at 09:10:25PM +0530, Manivannan Sadhasivam wrote:
> > > > > > On Thu, Aug 22, 2024 at 10:16:58AM -0500, Bjorn Helgaas wrote:
> > > > > > > On Thu, Aug 22, 2024 at 12:18:23PM +0530, Manivannan Sadhasivam wrote:
> > > > > > > > On Wed, Aug 21, 2024 at 05:56:18PM -0500, Bjorn Helgaas wrote:
> > > > > > > > ...
> > > > > > > 
> > > > > > > > > Although I do have the question of what happens if the RC deasserts
> > > > > > > > > PERST# before qcom-ep is loaded.  We probably don't execute
> > > > > > > > > qcom_pcie_perst_deassert() in that case, so how does the init happen?
> > > > > > > > 
> > > > > > > > PERST# is a level trigger signal. So even if the host has asserted
> > > > > > > > it before EP booted, the level will stay low and ep will detect it
> > > > > > > > while booting.
> > > > > > > 
> > > > > > > The PERST# signal itself is definitely level oriented.
> > > > > > > 
> > > > > > > I'm still skeptical about the *interrupt* from the PCIe controller
> > > > > > > being level-triggered, as I mentioned here:
> > > > > > > https://lore.kernel.org/r/20240815224735.GA57931@bhelgaas
> > > > > > 
> > > > > > Sorry, that comment got buried into my inbox. So didn't get a chance
> > > > > > to respond.
> > > > > > 
> > > > > > > tegra194 is also dwc-based and has a similar PERST# interrupt but
> > > > > > > it's edge-triggered (tegra_pcie_ep_pex_rst_irq()), which I think
> > > > > > > is a cleaner implementation.  Then you don't have to remember the
> > > > > > > current state, switch between high and low trigger, worry about
> > > > > > > races and missing a pulse, etc.
> > > > > > 
> > > > > > I did try to mimic what tegra194 did when I wrote the qcom-ep
> > > > > > driver, but it didn't work. If we use the level triggered interrupt
> > > > > > as edge, the interrupt will be missed if we do not listen at the
> > > > > > right time (when PERST# goes from high to low and vice versa).
> > > > > > 
> > > > > > I don't know how tegra194 interrupt controller is wired up, but IIUC
> > > > > > they will need to boot the endpoint first and then host to catch the
> > > > > > PERST# interrupt.  Otherwise, the endpoint will never see the
> > > > > > interrupt until host toggles it again.
> > > > > 
> > > > > Having to control the boot ordering of endpoint and host is definitely
> > > > > problematic.
> > > > > 
> > > > > What is the nature of the crash when we try to enable the PHY when
> > > > > Refclk is not available?  The endpoint has no control over when the
> > > > > host asserts/deasserts PERST#.  If PERST# happens to be asserted while
> > > > > the endpoint is enabling the PHY, and this causes some kind of crash
> > > > > that the endpoint driver can't easily recover from, that's a serious
> > > > > robustness problem.
> > > > 
> > > > The whole endpoint SoC crashes if the refclk is not available during
> > > > phy_power_on() as the PHY driver tries to access some register on Dmitry's
> > > > platform (I did not see this crash on SM8450 SoC though).
> 
> I don't think the nature of this crash has been explained, so I don't
> know whether it's a recoverable situation or not.
> 

I will add this info in the commit message.

> > > > If we keep the enable_resources() during probe() then the race
> > > > condition you observed above could apply. So removing that from
> > > > probe() will also make the race condition go away,
> > > 
> > > Example:
> > > 
> > >   1) host deasserts PERST#
> > >   2) qcom-ep handles PERST# IRQ
> > >   3) qcom_pcie_ep_perst_irq_thread() calls qcom_pcie_perst_deassert()
> > >   4) host asserts PERST#, Refclk no longer valid
> > >   5) qcom_pcie_perst_deassert() calls qcom_pcie_enable_resources()
> > >   6) qcom_pcie_enable_resources() enables PHY
> > > 
> > > I don't see what prevents the PERST# assertion at 4.  It sounds like
> > > the endpoint SoC crashes at 6.
> > 
> > IDK why host would quickly assert the PERST# after deasserting
> > during probe() unless someone intentionally does that from host
> > side.
> 
> I think the host is allowed to assert PERST# at any arbitrary time, so
> an endpoint should be able to handle it no matter when it happens.
> 
> > If that happens then there is a possibility of the endpoint SoC
> > crash, but I'm not sure how we can avoid that.
> > 
> > But what this patch fixes is a crash occuring in a sane scenario:
> > 
> > 1) Endpoint boots first (no refclk from host)
> > 2) Probe() calls qcom_pcie_enable_resources() --> Crash
> 
> I agree with this, although I think it's more of a band-aid than a
> complete solution.  I don't have access to any SoC or PCIe controller
> docs, so maybe this is a hardware design problem and this is the best
> we can do.
> 

I agree. But AFAIK there is no way endpoint can avoid this crash unless it
generates its own clock. I did some investigation on the SRIS support and able
to get it work in my local branch.

I will try to upstream that feature for the currerntly supported Qcom SoCs in
endpoint mode. But Qcom told me that non-SRIS mode is also required by some
customers, so unfortunately we cannot make it as the default operating mode.

- Mani

-- 
மணிவண்ணன் சதாசிவம்