(adding Nicholas) On 26/08/20 6:38 pm, Heiner Kallweit wrote: > On 26.08.2020 08:07, Chris Packham wrote: >> On 26/08/20 1:48 pm, Chris Packham wrote: >>> On 26/08/20 10:22 am, Chris Packham wrote: >>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote: >>>> >>>> <snip> >>>>> I've been staring at spi-fsl-espi.c for while now and I think I've >>>>>> identified a couple of deficiencies that may or may not be related >>>>>> to my >>>>>> issue. >>>>>> >>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message >>>>>> can be >>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE >>>>>> register. >>>>>> We also write back to it to clear the current events. We re-read it in >>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can >>>>>> naturally end up in that situation if we're doing a large read. >>>>>> Consider >>>>>> the messages for reading a block of data from a spi-nor chip >>>>>> >>>>>> tx = READ_OP + ADDR >>>>>> rx = data >>>>>> >>>>>> We setup the transfer and pump out the tx_buf. The first interrupt >>>>>> goes >>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo, >>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt >>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This >>>>>> continues until we've received all the data and we finish with >>>>>> ESPI_SPIE >>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON >>>>>> isn't set. >>>>>> >>>>>> The other deficiency is that we only get an interrupt when the >>>>>> amount of >>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than >>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of >>>>>> the fifo. >>>>>> >>>>> SPIM_DON will trigger an interrupt once the last characters have been >>>>> transferred, and read the remaining characters from the FIFO. >>>> The T2080RM that I have says the following about the DON bit >>>> >>>> "Last character was transmitted. The last character was transmitted >>>> and a new command can be written for the next frame." >>>> >>>> That does at least seem to fit with my assertion that it's all about >>>> the TX direction. But the fact that it doesn't happen all the time >>>> throws some doubt on it. >>>> >>>>> I think the reason I'm seeing some variability is because of how fast >>>>>> (or slow) the interrupts get processed and how fast the spi-nor >>>>>> chip can >>>>>> fill the CPUs rx fifo. >>>>>> >>>>> To rule out timing issues at high bus frequencies I initially asked >>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz >>>>> or even less, then timing shouldn't be an issue. >>>> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I >>>> would also expect a slower frequency would fit my "DON is for TX" >>>> narrative. >>>>> Last relevant functional changes have been done almost 4 years ago. >>>>> And yours is the first such report I see. So question is what could >>>>> be so >>>>> special with your setup that it seems you're the only one being >>>>> affected. >>>>> The scenarios you describe are standard, therefore much more people >>>>> should be affected in case of a driver bug. >>>> Agreed. But even on my hardware (which may have a latent issue >>>> despite being in the field for going on 5 years) the issue only >>>> triggers under some fairly specific circumstances. >>>>> You said that kernel config impacts how frequently the issue happens. >>>>> Therefore question is what's the diff in kernel config, and how could >>>>> the differences be related to SPI. >>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an >>>> impact but every time I found something that seemed to be having an >>>> impact I've been able to disprove it. I actually think its about how >>>> busy the system is which may or may not affect when we get round to >>>> processing the interrupts. >>>> >>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to >>>> occur on the T2080RDB. >>>> >>>> I've had to add the following to expose the environment as a mtd >>>> partition >>>> >>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >>>> index ff87e67c70da..fbf95fc1fd68 100644 >>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >>>> @@ -116,6 +116,15 @@ flash@0 { >>>> compatible = "micron,n25q512ax3", >>>> "jedec,spi-nor"; >>>> reg = <0>; >>>> spi-max-frequency = <10000000>; /* >>>> input clock */ >>>> + >>>> + partition@u-boot { >>>> + reg = <0x00000000 0x00100000>; >>>> + label = "u-boot"; >>>> + }; >>>> + partition@u-boot-env { >>>> + reg = <0x00100000 0x00010000>; >>>> + label = "u-boot-env"; >>>> + }; >>>> }; >>>> }; >>>> >>>> And I'm using the following script to poke at the environment >>>> (warning if anyone does try this and the bug hits it can render your >>>> u-boot environment invalid). >>>> >>>> cat flash/fw_env_test.sh >>>> #!/bin/sh >>>> >>>> generate_fw_env_config() >>>> { >>>> cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize >>>> name ; do >>>> echo "$dev $size $erasesize $name" >>>> [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 >>>> $erasesize" >/flash/fw_env.config >>>> done >>>> } >>>> >>>> cycles=10 >>>> [ $# -ge 1 ] && cycles=$1 >>>> >>>> generate_fw_env_config >>>> >>>> fw_printenv -c /flash/fw_env.config >>>> >>>> dmesg -c >/dev/null >>>> x=0 >>>> while [ $x -lt $cycles ]; do >>>> fw_printenv -c /flash/fw_env.config >/dev/null || break >>>> fw_setenv -c /flash/fw_env.config foo $RANDOM || break; >>>> dmesg -c | grep -q fsl_espi && break; >>>> let x=x+1 >>>> done >>>> >>>> echo "Ran $x cycles" >>> I've also now seen the RX FIFO not empty error on the T2080RDB >>> >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> >>> With my current workaround of emptying the RX FIFO. It seems >>> survivable. Interestingly it only ever seems to be 1 extra byte in the >>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >>> >>> fsl_espi ffe110000.spi: tx 70 >>> fsl_espi ffe110000.spi: rx 03 >>> fsl_espi ffe110000.spi: Extra RX 00 >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe110000.spi: tx 05 >>> fsl_espi ffe110000.spi: rx 00 >>> fsl_espi ffe110000.spi: Extra RX 03 >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe110000.spi: tx 05 >>> fsl_espi ffe110000.spi: rx 00 >>> fsl_espi ffe110000.spi: Extra RX 03 >>> >>> From all the Micron SPI-NOR datasheets I've got access to it is >>> possible to continually read the SR/FSR. But I've no idea why it >>> happens some times and not others. >> So I think I've got a reproduction and I think I've bisected the problem >> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in >> C"). My day is just finishing now so I haven't applied too much scrutiny >> to this result. Given the various rabbit holes I've been down on this >> issue already I'd take this information with a good degree of skepticism. >> > OK, so an easy test should be to re-test with a 5.4 kernel. > It doesn't have yet the change you're referring to, and the fsl-espi driver > is basically the same as in 5.7 (just two small changes in 5.7). Cherry-picking 3282a3da25bd onto 5.4.x I'm able to reproduce the problem on the T2080RDB (if anyone want's it I've got a slightly different version of my test above that doesn't require touching the dts). I'm not really sure how 3282a3da25bd affects this. It could be exposing a latent issue, the fsl-espi interface does seem a bit sensitive to the timing of events.