On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote: > This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71. > > The patch introduced a sporadic error where the Qdrive3 will fail to > boot occasionally due to an rcu preempt stall. > Qualcomm has disabled pcie2a downstream: > https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476 > rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8) > Call trace: > __do_softirq > ____do_softirq > call_on_irq_stack > do_softirq_own_stack > __irq_exit_rcu > irq_exit_rcu > > The issue occurs normally once every 3-4 boot cycles. > There is likely a race condition caused when setting up the two pcie > domains concurrently (pcie2a and pcie3a). > > The issue is not present when only pcie2a is enabled or when only pcie3a > is enabled. > A workaround was found that allowed the Qdrive3 to boot with both pcie2a > and pcie3a enabled. > Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to > the probing function. > This is not a solution, so this patch is disabling pcie2a as it seems > Red Hat are the only ones working on the board, > we're find with disabling the node until a root cause is found. If > anyone has further suggestions for debugging, let me know. > > Signed-off-by: Lucas Karpinski <lkarpins@xxxxxxxxxx> > --- > During debugging: > - Added additional time for clock/regulator stabilization. > - Reduced the bandwidth across pcie2a and pcie3a. > - Replaced the interconnect setup from another driver. > - The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly. > - Verified interconnects were started successfully. I was looking at another issue downstream triggering a soft lock on CPU0, but it turns out this could be the same thing except the symptoms are less noticeable (the 3-4 boot cycles you mention). Using next-20230609, if I add a return kprobe on dw_handle_msi_irq: echo 'r:dwmsi_probe dw_handle_msi_irq $retval' > /sys/kernel/debug/tracing/kprobe_events echo 1 > /sys/kernel/debug/tracing/events/kprobes/dwmsi_probe/enable cat /sys/kernel/debug/tracing/trace_pipe <idle>-0 [000] d.h1. 690.417268: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417272: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417276: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417281: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417284: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 <idle>-0 [000] d.h1. 690.417288: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0 [...] dw_handle_msi_irq constantly fires and never returns IRQ_HANDLED. It happens consistently for pcie2a or pcie3a, after I disable one or the other. I presume having both might be enough to overwhelm the system and trigger the stall? Looking at the handler, the status is always 0 after: status = dw_pcie_readl_dbi(pci, PCIE_MSI_INTR0_STATUS + (i * MSI_REG_CTRL_BLOCK_SIZE)); Unfortunately I do not know why that is yet. > > arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 ----------------------- > 1 file changed, 44 deletions(-) > > diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > index 24fa449d48a6..d492723ccf7c 100644 > --- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > +++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts > @@ -186,27 +186,6 @@ &i2c18 { > status = "okay"; > }; > > -&pcie2a { > - ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>, > - <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>, > - <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>; > - > - perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>; > - wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>; > - > - pinctrl-names = "default"; > - pinctrl-0 = <&pcie2a_default>; > - > - status = "okay"; > -}; > - > -&pcie2a_phy { > - vdda-phy-supply = <&vreg_l11a>; > - vdda-pll-supply = <&vreg_l3a>; > - > - status = "okay"; > -}; > - > &pcie3a { > ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>, > <0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>, > @@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state { > bias-pull-up; > }; > > - pcie2a_default: pcie2a-default-state { > - perst-pins { > - pins = "gpio143"; > - function = "gpio"; > - drive-strength = <2>; > - bias-pull-down; > - }; > - > - clkreq-pins { > - pins = "gpio142"; > - function = "pcie2a_clkreq"; > - drive-strength = <2>; > - bias-pull-up; > - }; > - > - wake-pins { > - pins = "gpio145"; > - function = "gpio"; > - drive-strength = <2>; > - bias-pull-up; > - }; > - }; > - > pcie3a_default: pcie3a-default-state { > perst-pins { > pins = "gpio151"; > -- > 2.40.1 > -- Eric Chanudet