On 26/08/20 1:48 pm, Chris Packham wrote: > > On 26/08/20 10:22 am, Chris Packham wrote: >> On 25/08/20 7:22 pm, Heiner Kallweit wrote: >> >> <snip> >>> I've been staring at spi-fsl-espi.c for while now and I think I've >>>> identified a couple of deficiencies that may or may not be related >>>> to my >>>> issue. >>>> >>>> First I think the 'Transfer done but SPIE_DON isn't set' message >>>> can be >>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE >>>> register. >>>> We also write back to it to clear the current events. We re-read it in >>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can >>>> naturally end up in that situation if we're doing a large read. >>>> Consider >>>> the messages for reading a block of data from a spi-nor chip >>>> >>>> tx = READ_OP + ADDR >>>> rx = data >>>> >>>> We setup the transfer and pump out the tx_buf. The first interrupt >>>> goes >>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo, >>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt >>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This >>>> continues until we've received all the data and we finish with >>>> ESPI_SPIE >>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON >>>> isn't set. >>>> >>>> The other deficiency is that we only get an interrupt when the >>>> amount of >>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than >>>> FSL_ESPI_RXTHR left to be received we will never pull them out of >>>> the fifo. >>>> >>> SPIM_DON will trigger an interrupt once the last characters have been >>> transferred, and read the remaining characters from the FIFO. >> >> The T2080RM that I have says the following about the DON bit >> >> "Last character was transmitted. The last character was transmitted >> and a new command can be written for the next frame." >> >> That does at least seem to fit with my assertion that it's all about >> the TX direction. But the fact that it doesn't happen all the time >> throws some doubt on it. >> >>> I think the reason I'm seeing some variability is because of how fast >>>> (or slow) the interrupts get processed and how fast the spi-nor >>>> chip can >>>> fill the CPUs rx fifo. >>>> >>> To rule out timing issues at high bus frequencies I initially asked >>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz >>> or even less, then timing shouldn't be an issue. >> Yes I've currently got spi-max-frequency = <1000000>; in my dts. I >> would also expect a slower frequency would fit my "DON is for TX" >> narrative. >>> Last relevant functional changes have been done almost 4 years ago. >>> And yours is the first such report I see. So question is what could >>> be so >>> special with your setup that it seems you're the only one being >>> affected. >>> The scenarios you describe are standard, therefore much more people >>> should be affected in case of a driver bug. >> Agreed. But even on my hardware (which may have a latent issue >> despite being in the field for going on 5 years) the issue only >> triggers under some fairly specific circumstances. >>> You said that kernel config impacts how frequently the issue happens. >>> Therefore question is what's the diff in kernel config, and how could >>> the differences be related to SPI. >> >> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an >> impact but every time I found something that seemed to be having an >> impact I've been able to disprove it. I actually think its about how >> busy the system is which may or may not affect when we get round to >> processing the interrupts. >> >> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to >> occur on the T2080RDB. >> >> I've had to add the following to expose the environment as a mtd >> partition >> >> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >> index ff87e67c70da..fbf95fc1fd68 100644 >> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi >> @@ -116,6 +116,15 @@ flash@0 { >> compatible = "micron,n25q512ax3", >> "jedec,spi-nor"; >> reg = <0>; >> spi-max-frequency = <10000000>; /* >> input clock */ >> + >> + partition@u-boot { >> + reg = <0x00000000 0x00100000>; >> + label = "u-boot"; >> + }; >> + partition@u-boot-env { >> + reg = <0x00100000 0x00010000>; >> + label = "u-boot-env"; >> + }; >> }; >> }; >> >> And I'm using the following script to poke at the environment >> (warning if anyone does try this and the bug hits it can render your >> u-boot environment invalid). >> >> cat flash/fw_env_test.sh >> #!/bin/sh >> >> generate_fw_env_config() >> { >> cat /proc/mtd | sed 's/[:"]//g' | while read dev size erasesize >> name ; do >> echo "$dev $size $erasesize $name" >> [ "$name" = "u-boot-env" ] && echo "/dev/$dev 0x0000 0x2000 >> $erasesize" >/flash/fw_env.config >> done >> } >> >> cycles=10 >> [ $# -ge 1 ] && cycles=$1 >> >> generate_fw_env_config >> >> fw_printenv -c /flash/fw_env.config >> >> dmesg -c >/dev/null >> x=0 >> while [ $x -lt $cycles ]; do >> fw_printenv -c /flash/fw_env.config >/dev/null || break >> fw_setenv -c /flash/fw_env.config foo $RANDOM || break; >> dmesg -c | grep -q fsl_espi && break; >> let x=x+1 >> done >> >> echo "Ran $x cycles" > > I've also now seen the RX FIFO not empty error on the T2080RDB > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > With my current workaround of emptying the RX FIFO. It seems > survivable. Interestingly it only ever seems to be 1 extra byte in the > RX FIFO and it seems to be after either a READ_SR or a READ_FSR. > > fsl_espi ffe110000.spi: tx 70 > fsl_espi ffe110000.spi: rx 03 > fsl_espi ffe110000.spi: Extra RX 00 > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > fsl_espi ffe110000.spi: tx 05 > fsl_espi ffe110000.spi: rx 00 > fsl_espi ffe110000.spi: Extra RX 03 > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > fsl_espi ffe110000.spi: tx 05 > fsl_espi ffe110000.spi: rx 00 > fsl_espi ffe110000.spi: Extra RX 03 > > From all the Micron SPI-NOR datasheets I've got access to it is > possible to continually read the SR/FSR. But I've no idea why it > happens some times and not others. So I think I've got a reproduction and I think I've bisected the problem to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C"). My day is just finishing now so I haven't applied too much scrutiny to this result. Given the various rabbit holes I've been down on this issue already I'd take this information with a good degree of skepticism. Thanks, Chris