Excerpts from Chris Packham's message of September 1, 2020 11:25 am: > > On 1/09/20 12:33 am, Heiner Kallweit wrote: >> On 30.08.2020 23:59, Chris Packham wrote: >>> On 31/08/20 9:41 am, Heiner Kallweit wrote: >>>> On 30.08.2020 23:00, Chris Packham wrote: >>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote: >>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >>>>> <snip> >>>>> >>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB >>>>>>>>>>> >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>> >>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems >>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the >>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >>>>>>>>>>> >>>>>>>>>>> fsl_espi ffe110000.spi: tx 70 >>>>>>>>>>> fsl_espi ffe110000.spi: rx 03 >>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00 >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>> >>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is >>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it >>>>>>>>>>> happens some times and not others. >>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem >>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in >>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny >>>>>>>>>> to this result. Given the various rabbit holes I've been down on this >>>>>>>>>> issue already I'd take this information with a good degree of skepticism. >>>>>>>>>> >>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel. >>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver >>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7). >>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches >>>>>>>> around this time that could affect book E, so it's good if that exact >>>>>>>> patch is confirmed. >>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel >>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in >>>>>>> 5.9-rc2 by reverting that one commit. >>>>>>> >>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also >>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees >>>>>>> the issue) and the commit just before (which does not). >>>>>> Thanks for testing, that confirms it well. >>>>>> >>>>>> [snip patch] >>>>>> >>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG >>>>>>> didn't report anything (either with or without the change above). >>>>>> Okay, it was a bit of a shot in the dark. I still can't see what >>>>>> else has changed. >>>>>> >>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or >>>>>> higher interrupt latency? >>>>>> >>>>>> I don't think the patch should cause significantly worse latency, >>>>>> (it's supposed to be a bit better if anything because it doesn't set >>>>>> up the full interrupt frame). But it's possible. >>>>> My working theory is that the SPI_DON indication is all about the TX >>>>> direction an now that the interrupts are faster we're hitting an error >>>>> because there is still RX activity going on. Heiner disagrees with my >>>>> interpretation of the SPI_DON indication and the fact that it doesn't >>>>> happen every time does throw doubt on it. >>>>> >>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to >>>> TX only. However this wouldn't really make sense, because also for RX >>>> we program the frame length, and therefore want to be notified once the >>>> full frame was received. Also practical experience shows that SPI_DON >>>> is set also after RX-only transfers. >>>> Typical SPI NOR use case is that you write read command + start address, >>>> followed by a longer read. If the TX-only interpretation would be right, >>>> we'd always end up with SPI_DON not being set. >>>> >>>>> I can't really explain the extra RX byte in the fifo. We know how many >>>>> bytes to expect and we pull that many from the fifo so it's not as if >>>>> we're missing an interrupt causing us to skip the last byte. I've been >>>>> looking for some kind of off-by-one calculation but again if it were >>>>> something like that it'd happen all the time. >>>>> >>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it: >>>> - a duplicate of the last read byte >>>> - or the next byte (at <end address> + 1) >>>> - or a fixed value, e.g. always 0x00 or 0xff >>> The values were up thread a bit but I'll repeat them here >>> >>> fsl_espi ffe110000.spi: tx 70 >>> fsl_espi ffe110000.spi: rx 03 >>> fsl_espi ffe110000.spi: Extra RX 00 >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe110000.spi: tx 05 >>> fsl_espi ffe110000.spi: rx 00 >>> fsl_espi ffe110000.spi: Extra RX 03 >>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe110000.spi: tx 05 >>> fsl_espi ffe110000.spi: rx 00 >>> fsl_espi ffe110000.spi: Extra RX 03 >>> >>> >>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with >>> either a READ_SR or a READ_FSR. Never a data read. >>> >> Just remembered something about SPIE_DON: >> Transfers are always full duplex, therefore in case of a read the chip >> sends dummy zero's. Having said that in case of a read SPIE_DON means >> that the last dummy zero was shifted out. >> >> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in. >> So the issue may have a dependency on the length of the transfer. >> However I see no good explanation so far. You can try adding a delay of >> a few miroseconds between the following to commands in fsl_espi_bufs(). >> >> fsl_espi_write_reg(espi, ESPI_SPIM, mask); >> >> /* Prevent filling the fifo from getting interrupted */ >> spin_lock_irq(&espi->lock); >> >> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close. > I think this might be heading in the right direction. Playing about with > a delay does seem to make the two symptoms less likely. Although I have > to set it quite high (i.e. msleep(100)) to completely avoid any > possibility of seeing either message. The patch might replay the interrupt a little bit faster, but it would be a few microseconds at most I think (just from improved code). Would you be able to ftrace the interrupt handler function and see if you can see a difference in number or timing of interrupts? I'm at a bit of a loss. Thanks, Nick