On 26/09/17 00:15, Shawn N wrote: > On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@xxxxxxxxxx> wrote: >> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@xxxxxxxxxxxx> wrote: >>> Hi, >>> >>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote: >>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is >>>> getting messed up, or the reply buffer is getting corrupted somehow. >>>> >>>> ec_dev->proto_version = >>>> min(EC_HOST_REQUEST_VERSION, >>>> fls(proto_info->protocol_versions) - 1); >>>> >> >> Checking this closer, the first host command we send after we boot the >> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error >> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since >> this doesn't seem to happen on the Chromium OS nyan_big release >> kernel, I suggest to hook up a logic analyzer and see if the SPI >> master is doing something bad. >> >> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and >> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets >> "the host command was received by the EC and is currently being >> handled, poll status until completion". So the caller polls status >> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress >> (which is interpreted to mean "the host command I sent previously has >> now successfully completed"), and returns success. The problem here is >> that the initial host command was never received at all, and no reply >> was ever received, so our reply data is all zero. >> >> Two things need to be fixed here: >> >> 1) Find out why the first host command after boot is failing. Probe >> SPI pins and see what's going on. >> 2) Fix error handling so we properly return an error (or properly >> retry the entire command) when a protocol error occurs (I made some >> attempt in https://chromium-review.googlesource.com/385080/, probably >> I should revisit that). > > The below patch will fix error handling and will make things mostly > work on nyan_big, because we'll fall back to V2 protocol after the > initial failure. But we should still investigate why we're getting > errors on the first host command. We aren't seeing these errors when > we send commands from firmware, so I suspect something is wrong in > kernel SPI HW initialization that causes the first command to fail. I have been looking into this a bit more and it appears to be timing related. I found that enabling some debug in the Tegra SPI driver the problem would go away and seems that adding a delay before sending the SPI message would also workaround the problem. Looking back at the Tegra Linux test history [0], it appears that after v4.10-rc5 [1] I start to see the following message which is the first indication of some SPI issues ... cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4) I attempted to bisect this, but I was not successful because each time I would ended up somewhere different. I found that even with v4.9 I may see the issue 1 in 20 times and so I realised that I am not even sure when the problem really started or if has always been there. It seems that for older kernels it is harder to reproduce. I am wondering now if some timing has changed somewhere causing us to see the problem more frequently with newer kernels? I see the the cros-ec binding defines the following ... "google,cros-ec-spi-pre-delay: Some implementations of the EC need a little time to wake up from sleep before they can receive SPI transfers at a high clock rate. This property specifies the delay, in usecs, between the assertion of the CS to the start of the first clock pulse." I found that adding the following also worked around the problem ... diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi index 5cf987b5401e..0baa6bfc0f36 100644 --- a/arch/arm/boot/dts/tegra124-nyan.dtsi +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi @@ -317,6 +317,7 @@ interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>; reg = <0>; + google,cros-ec-spi-pre-delay = <10>; google,cros-ec-spi-msg-delay = <2000>; i2c-tunnel { I have tried 50 boots with the above and I have seen no SPI failures on boot. I did look to see if it is possible to probe the SPI signals with a scope but from the schematics I am not sure if they are accessible or buried in the PCB. Is it possible that Tegra is sending the SPI message too soon for the EC? Cheers Jon [0] https://nvtb.github.io/linux/ [1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt -- nvpublic -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html