On 1/09/20 6:14 pm, Nicholas Piggin wrote: > Excerpts from Chris Packham's message of September 1, 2020 11:25 am: >> On 1/09/20 12:33 am, Heiner Kallweit wrote: >>> On 30.08.2020 23:59, Chris Packham wrote: >>>> On 31/08/20 9:41 am, Heiner Kallweit wrote: >>>>> On 30.08.2020 23:00, Chris Packham wrote: >>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote: >>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >>>>>> <snip> >>>>>> >>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB >>>>>>>>>>>> >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>> >>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems >>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the >>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >>>>>>>>>>>> >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70 >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03 >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00 >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>>> >>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is >>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it >>>>>>>>>>>> happens some times and not others. >>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem >>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in >>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny >>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this >>>>>>>>>>> issue already I'd take this information with a good degree of skepticism. >>>>>>>>>>> >>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel. >>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver >>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7). >>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches >>>>>>>>> around this time that could affect book E, so it's good if that exact >>>>>>>>> patch is confirmed. >>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel >>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in >>>>>>>> 5.9-rc2 by reverting that one commit. >>>>>>>> >>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also >>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees >>>>>>>> the issue) and the commit just before (which does not). >>>>>>> Thanks for testing, that confirms it well. >>>>>>> >>>>>>> [snip patch] >>>>>>> >>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG >>>>>>>> didn't report anything (either with or without the change above). >>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what >>>>>>> else has changed. >>>>>>> >>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or >>>>>>> higher interrupt latency? >>>>>>> >>>>>>> I don't think the patch should cause significantly worse latency, >>>>>>> (it's supposed to be a bit better if anything because it doesn't set >>>>>>> up the full interrupt frame). But it's possible. >>>>>> My working theory is that the SPI_DON indication is all about the TX >>>>>> direction an now that the interrupts are faster we're hitting an error >>>>>> because there is still RX activity going on. Heiner disagrees with my >>>>>> interpretation of the SPI_DON indication and the fact that it doesn't >>>>>> happen every time does throw doubt on it. >>>>>> >>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to >>>>> TX only. However this wouldn't really make sense, because also for RX >>>>> we program the frame length, and therefore want to be notified once the >>>>> full frame was received. Also practical experience shows that SPI_DON >>>>> is set also after RX-only transfers. >>>>> Typical SPI NOR use case is that you write read command + start address, >>>>> followed by a longer read. If the TX-only interpretation would be right, >>>>> we'd always end up with SPI_DON not being set. >>>>> >>>>>> I can't really explain the extra RX byte in the fifo. We know how many >>>>>> bytes to expect and we pull that many from the fifo so it's not as if >>>>>> we're missing an interrupt causing us to skip the last byte. I've been >>>>>> looking for some kind of off-by-one calculation but again if it were >>>>>> something like that it'd happen all the time. >>>>>> >>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it: >>>>> - a duplicate of the last read byte >>>>> - or the next byte (at <end address> + 1) >>>>> - or a fixed value, e.g. always 0x00 or 0xff >>>> The values were up thread a bit but I'll repeat them here >>>> >>>> fsl_espi ffe110000.spi: tx 70 >>>> fsl_espi ffe110000.spi: rx 03 >>>> fsl_espi ffe110000.spi: Extra RX 00 >>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>> fsl_espi ffe110000.spi: tx 05 >>>> fsl_espi ffe110000.spi: rx 00 >>>> fsl_espi ffe110000.spi: Extra RX 03 >>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>> fsl_espi ffe110000.spi: tx 05 >>>> fsl_espi ffe110000.spi: rx 00 >>>> fsl_espi ffe110000.spi: Extra RX 03 >>>> >>>> >>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with >>>> either a READ_SR or a READ_FSR. Never a data read. >>>> >>> Just remembered something about SPIE_DON: >>> Transfers are always full duplex, therefore in case of a read the chip >>> sends dummy zero's. Having said that in case of a read SPIE_DON means >>> that the last dummy zero was shifted out. >>> >>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in. >>> So the issue may have a dependency on the length of the transfer. >>> However I see no good explanation so far. You can try adding a delay of >>> a few miroseconds between the following to commands in fsl_espi_bufs(). >>> >>> fsl_espi_write_reg(espi, ESPI_SPIM, mask); >>> >>> /* Prevent filling the fifo from getting interrupted */ >>> spin_lock_irq(&espi->lock); >>> >>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close. >> I think this might be heading in the right direction. Playing about with >> a delay does seem to make the two symptoms less likely. Although I have >> to set it quite high (i.e. msleep(100)) to completely avoid any >> possibility of seeing either message. > The patch might replay the interrupt a little bit faster, but it would > be a few microseconds at most I think (just from improved code). > > Would you be able to ftrace the interrupt handler function and see if you > can see a difference in number or timing of interrupts? I'm at a bit of > a loss. This is getting really weird. I was setting up to run with ftrace and found I couldn't reproduce it on the tip of Linus's tree (currently pointing at e7a522c83b86). But I swear I could last week. Sure enough if I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again. > Thanks, > Nick >