On 2/09/20 11:29 am, Chris Packham wrote: > > On 1/09/20 6:14 pm, Nicholas Piggin wrote: >> Excerpts from Chris Packham's message of September 1, 2020 11:25 am: >>> On 1/09/20 12:33 am, Heiner Kallweit wrote: >>>> On 30.08.2020 23:59, Chris Packham wrote: >>>>> On 31/08/20 9:41 am, Heiner Kallweit wrote: >>>>>> On 30.08.2020 23:00, Chris Packham wrote: >>>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote: >>>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >>>>>>> <snip> >>>>>>> >>>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the >>>>>>>>>>>>> T2080RDB >>>>>>>>>>>>> >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's >>>>>>>>>>>>> aren't empty! >>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>>> >>>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems >>>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra >>>>>>>>>>>>> byte in the >>>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a >>>>>>>>>>>>> READ_FSR. >>>>>>>>>>>>> >>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70 >>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03 >>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00 >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's >>>>>>>>>>>>> aren't empty! >>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's >>>>>>>>>>>>> aren't empty! >>>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05 >>>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00 >>>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>>>>>>>>>> >>>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got >>>>>>>>>>>>> access to it is >>>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea >>>>>>>>>>>>> why it >>>>>>>>>>>>> happens some times and not others. >>>>>>>>>>>> So I think I've got a reproduction and I think I've >>>>>>>>>>>> bisected the problem >>>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft >>>>>>>>>>>> interrupt replay in >>>>>>>>>>>> C"). My day is just finishing now so I haven't applied too >>>>>>>>>>>> much scrutiny >>>>>>>>>>>> to this result. Given the various rabbit holes I've been >>>>>>>>>>>> down on this >>>>>>>>>>>> issue already I'd take this information with a good degree >>>>>>>>>>>> of skepticism. >>>>>>>>>>>> >>>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel. >>>>>>>>>>> It doesn't have yet the change you're referring to, and the >>>>>>>>>>> fsl-espi driver >>>>>>>>>>> is basically the same as in 5.7 (just two small changes in >>>>>>>>>>> 5.7). >>>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related >>>>>>>>>> patches >>>>>>>>>> around this time that could affect book E, so it's good if >>>>>>>>>> that exact >>>>>>>>>> patch is confirmed. >>>>>>>>> My confirmation is basically that I can induce the issue in a >>>>>>>>> 5.4 kernel >>>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the >>>>>>>>> issue in >>>>>>>>> 5.9-rc2 by reverting that one commit. >>>>>>>>> >>>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I >>>>>>>>> also >>>>>>>>> confirmed the bisection result by building at 3282a3da25bd >>>>>>>>> (which sees >>>>>>>>> the issue) and the commit just before (which does not). >>>>>>>> Thanks for testing, that confirms it well. >>>>>>>> >>>>>>>> [snip patch] >>>>>>>> >>>>>>>>> I still saw the issue with this change applied. >>>>>>>>> PPC_IRQ_SOFT_MASK_DEBUG >>>>>>>>> didn't report anything (either with or without the change above). >>>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what >>>>>>>> else has changed. >>>>>>>> >>>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or >>>>>>>> higher interrupt latency? >>>>>>>> >>>>>>>> I don't think the patch should cause significantly worse latency, >>>>>>>> (it's supposed to be a bit better if anything because it >>>>>>>> doesn't set >>>>>>>> up the full interrupt frame). But it's possible. >>>>>>> My working theory is that the SPI_DON indication is all about >>>>>>> the TX >>>>>>> direction an now that the interrupts are faster we're hitting an >>>>>>> error >>>>>>> because there is still RX activity going on. Heiner disagrees >>>>>>> with my >>>>>>> interpretation of the SPI_DON indication and the fact that it >>>>>>> doesn't >>>>>>> happen every time does throw doubt on it. >>>>>>> >>>>>> It's right that the eSPI spec can be interpreted that SPI_DON >>>>>> refers to >>>>>> TX only. However this wouldn't really make sense, because also >>>>>> for RX >>>>>> we program the frame length, and therefore want to be notified >>>>>> once the >>>>>> full frame was received. Also practical experience shows that >>>>>> SPI_DON >>>>>> is set also after RX-only transfers. >>>>>> Typical SPI NOR use case is that you write read command + start >>>>>> address, >>>>>> followed by a longer read. If the TX-only interpretation would be >>>>>> right, >>>>>> we'd always end up with SPI_DON not being set. >>>>>> >>>>>>> I can't really explain the extra RX byte in the fifo. We know >>>>>>> how many >>>>>>> bytes to expect and we pull that many from the fifo so it's not >>>>>>> as if >>>>>>> we're missing an interrupt causing us to skip the last byte. >>>>>>> I've been >>>>>>> looking for some kind of off-by-one calculation but again if it >>>>>>> were >>>>>>> something like that it'd happen all the time. >>>>>>> >>>>>> Maybe it helps to know what value this extra byte in the FIFO >>>>>> has. Is it: >>>>>> - a duplicate of the last read byte >>>>>> - or the next byte (at <end address> + 1) >>>>>> - or a fixed value, e.g. always 0x00 or 0xff >>>>> The values were up thread a bit but I'll repeat them here >>>>> >>>>> fsl_espi ffe110000.spi: tx 70 >>>>> fsl_espi ffe110000.spi: rx 03 >>>>> fsl_espi ffe110000.spi: Extra RX 00 >>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>> fsl_espi ffe110000.spi: tx 05 >>>>> fsl_espi ffe110000.spi: rx 00 >>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! >>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! >>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>>>> fsl_espi ffe110000.spi: tx 05 >>>>> fsl_espi ffe110000.spi: rx 00 >>>>> fsl_espi ffe110000.spi: Extra RX 03 >>>>> >>>>> >>>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen >>>>> them with >>>>> either a READ_SR or a READ_FSR. Never a data read. >>>>> >>>> Just remembered something about SPIE_DON: >>>> Transfers are always full duplex, therefore in case of a read the chip >>>> sends dummy zero's. Having said that in case of a read SPIE_DON means >>>> that the last dummy zero was shifted out. >>>> >>>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 >>>> byte in. >>>> So the issue may have a dependency on the length of the transfer. >>>> However I see no good explanation so far. You can try adding a >>>> delay of >>>> a few miroseconds between the following to commands in >>>> fsl_espi_bufs(). >>>> >>>> fsl_espi_write_reg(espi, ESPI_SPIM, mask); >>>> >>>> /* Prevent filling the fifo from getting interrupted */ >>>> spin_lock_irq(&espi->lock); >>>> >>>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too >>>> close. >>> I think this might be heading in the right direction. Playing about >>> with >>> a delay does seem to make the two symptoms less likely. Although I have >>> to set it quite high (i.e. msleep(100)) to completely avoid any >>> possibility of seeing either message. >> The patch might replay the interrupt a little bit faster, but it would >> be a few microseconds at most I think (just from improved code). >> >> Would you be able to ftrace the interrupt handler function and see if >> you >> can see a difference in number or timing of interrupts? I'm at a bit of >> a loss. > This is getting really weird. I was setting up to run with ftrace and > found I couldn't reproduce it on the tip of Linus's tree (currently > pointing at e7a522c83b86). But I swear I could last week. Sure enough > if I checkout 5.9-rc2 (or 5.7.15) I can reproduce the problem again. *Sigh* my master branch still has the interrupt changes reverted.