Hi, On Tue, Oct 10, 2017 at 8:33 AM, Shawn N <shawnn@xxxxxxxxxxxx> wrote: >> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi >> index 5cf987b5401e..0baa6bfc0f36 100644 >> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi >> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi >> @@ -317,6 +317,7 @@ >> interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>; >> reg = <0>; >> >> + google,cros-ec-spi-pre-delay = <10>; >> google,cros-ec-spi-msg-delay = <2000>; >> >> i2c-tunnel { >> >> I have tried 50 boots with the above and I have seen no SPI failures >> on boot. >> >> I did look to see if it is possible to probe the SPI signals with a >> scope but from the schematics I am not sure if they are accessible or >> buried in the PCB. >> >> Is it possible that Tegra is sending the SPI message too soon for the >> EC? > > I have not worked much with the cros_ec stm32 SPI host interface, but > I think it's possible. Also note that we define > "google,cros-ec-spi-pre-delay = <30>;" for the veyron family of > devices, which also use an stm32-family embedded controller. Some of > the folks on cc worked more on veyron and may have more insight. Alex is the expert here, but basically we enabled hibernation on the EC for veyron so it needed a bunch of extra time to wakeup. Before hibernation there was no known case where the delay was needed as far as I understand. I don't think we had hibernation on tegra. It's _possible_ that we sometimes need a delay even on tegra, but I'm not sure. It's also possible that the extra delay just shifts things around a little bit and changes the timing. Possibly at boot the system is always extra busy and adding the 10 us delay is enough to "reliably" get your transfer to happen at a non-busy time. It would be interesting to see of a 10us delay _before_ asserting chip select fixed things just as well. If a delay before asserting chip select fixes things as well as a delay after asserting chip select then you're probably just pushing things around and not doing a real fix. One issue I _know_ we have is that we sometimes drop EC packets on the floor because we get interrupted midway through the transfer and the EC gives up on us before we get context switched back in. I think this is more common at suspend/resume and at boot when the system is typically very busy. That case can actually be made _worse_ by the "cros-ec-spi-pre-delay", as described in <https://chromium-review.googlesource.com/439713>. Even when we removed delay on rk3399-gru-* we could still sometimes see transfer errors and the agreed upon "fix" for this is to do the transfer in a higher priority context. Some details are in the (unfortunately private) <https://issuetracker.google.com/35579351>, but to copy the important bits from the bug:: >From Mark Brown (upstream SPI maintainer) as long as there is no contention on this SPI bus (there isn't for us) the transfer will happen in the context of the calling process. That means "bumping" the priority would work. ...but as per Brian bumping the priority is ugly and we need to make sure we don't conflict with userspace. One solution (inspired by Guenter) is to just _always_ use a high priority worker to do the transfer and just block waiting for the worker to finish. This should be easy, reliable, and clean even if it is slightly awkward. This also shouldn't be too hard to do, we think. NOTE: a really truly long term fix for this is to change the protocol to have reliably retries. That's planned but (as far as I can tell) stalled. Details in the bug <https://bugs.chromium.org/p/chromium/issues/detail?id=678675> > I'm still not clear on why we see an error only on the first > transaction after boot. In this case, the embedded controller > previously handled host commands from firmware just fine, and the > handoff between firmware and the kernel shouldn't be significant, from > the EC point of view. If host command timing is consistent from the > master, I would expect to see some constant error rate, eg. some > chance any host command will fail, rather than the first host command > always failing. The AP itself is often quite busy at boot and so the timings for everything change. That could easily explain the problems. -Doug -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html