On Thu, 2020-09-03 at 23:58 +0000, Chris Packham wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > > On 1/09/20 6:14 pm, Nicholas Piggin wrote: > > Excerpts from Chris Packham's message of September 1, 2020 11:25 am: > > > On 1/09/20 12:33 am, Heiner Kallweit wrote: > > > > On 30.08.2020 23:59, Chris Packham wrote: > > > > > On 31/08/20 9:41 am, Heiner Kallweit wrote: > > > > > > On 30.08.2020 23:00, Chris Packham wrote: > > > > > > > On 31/08/20 12:30 am, Nicholas Piggin wrote: > > > > > > > > Excerpts from Chris Packham's message of August 28, 2020 8:07 am: > > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > I've also now seen the RX FIFO not empty error on the T2080RDB > > > > > > > > > > > > > > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > > > > > > > > > > > > > > > > > > > > > > > > > With my current workaround of emptying the RX FIFO. It seems > > > > > > > > > > > > > survivable. Interestingly it only ever seems to be 1 extra byte in the > > > > > > > > > > > > > RX FIFO and it seems to be after either a READ_SR or a READ_FSR. > > > > > > > > > > > > > > > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 70 > > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 03 > > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 00 > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05 > > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00 > > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03 > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > > > > > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > > > > > > > > > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > > > > > > > > > > > > fsl_espi ffe110000.spi: tx 05 > > > > > > > > > > > > > fsl_espi ffe110000.spi: rx 00 > > > > > > > > > > > > > fsl_espi ffe110000.spi: Extra RX 03 > > > > > > > > > > > > > > > > > > > > > > > > > > From all the Micron SPI-NOR datasheets I've got access to it is > > > > > > > > > > > > > possible to continually read the SR/FSR. But I've no idea why it > > > > > > > > > > > > > happens some times and not others. > > > > > > > > > > > > So I think I've got a reproduction and I think I've bisected the problem > > > > > > > > > > > > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in > > > > > > > > > > > > C"). My day is just finishing now so I haven't applied too much scrutiny > > > > > > > > > > > > to this result. Given the various rabbit holes I've been down on this > > > > > > > > > > > > issue already I'd take this information with a good degree of skepticism. > > > > > > > > > > > > > > > > > > > > > > > OK, so an easy test should be to re-test with a 5.4 kernel. > > > > > > > > > > > It doesn't have yet the change you're referring to, and the fsl-espi driver > > > > > > > > > > > is basically the same as in 5.7 (just two small changes in 5.7). > > > > > > > > > > There's 6cc0c16d82f88 and maybe also other interrupt related patches > > > > > > > > > > around this time that could affect book E, so it's good if that exact > > > > > > > > > > patch is confirmed. > > > > > > > > > My confirmation is basically that I can induce the issue in a 5.4 kernel > > > > > > > > > by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in > > > > > > > > > 5.9-rc2 by reverting that one commit. > > > > > > > > > > > > > > > > > > I both cases it's not exactly a clean cherry-pick/revert so I also > > > > > > > > > confirmed the bisection result by building at 3282a3da25bd (which sees > > > > > > > > > the issue) and the commit just before (which does not). > > > > > > > > Thanks for testing, that confirms it well. > > > > > > > > > > > > > > > > [snip patch] > > > > > > > > > > > > > > > > > I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG > > > > > > > > > didn't report anything (either with or without the change above). > > > > > > > > Okay, it was a bit of a shot in the dark. I still can't see what > > > > > > > > else has changed. > > > > > > > > > > > > > > > > What would cause this, a lost interrupt? A spurious interrupt? Or > > > > > > > > higher interrupt latency? > > > > > > > > > > > > > > > > I don't think the patch should cause significantly worse latency, > > > > > > > > (it's supposed to be a bit better if anything because it doesn't set > > > > > > > > up the full interrupt frame). But it's possible. > > > > > > > My working theory is that the SPI_DON indication is all about the TX > > > > > > > direction an now that the interrupts are faster we're hitting an error > > > > > > > because there is still RX activity going on. Heiner disagrees with my > > > > > > > interpretation of the SPI_DON indication and the fact that it doesn't > > > > > > > happen every time does throw doubt on it. > > > > > > > > > > > > > It's right that the eSPI spec can be interpreted that SPI_DON refers to > > > > > > TX only. However this wouldn't really make sense, because also for RX > > > > > > we program the frame length, and therefore want to be notified once the > > > > > > full frame was received. Also practical experience shows that SPI_DON > > > > > > is set also after RX-only transfers. > > > > > > Typical SPI NOR use case is that you write read command + start address, > > > > > > followed by a longer read. If the TX-only interpretation would be right, > > > > > > we'd always end up with SPI_DON not being set. > > > > > > > > > > > > > I can't really explain the extra RX byte in the fifo. We know how many > > > > > > > bytes to expect and we pull that many from the fifo so it's not as if > > > > > > > we're missing an interrupt causing us to skip the last byte. I've been > > > > > > > looking for some kind of off-by-one calculation but again if it were > > > > > > > something like that it'd happen all the time. > > > > > > > > > > > > > Maybe it helps to know what value this extra byte in the FIFO has. Is it: > > > > > > - a duplicate of the last read byte > > > > > > - or the next byte (at <end address> + 1) > > > > > > - or a fixed value, e.g. always 0x00 or 0xff > > > > > The values were up thread a bit but I'll repeat them here > > > > > > > > > > fsl_espi ffe110000.spi: tx 70 > > > > > fsl_espi ffe110000.spi: rx 03 > > > > > fsl_espi ffe110000.spi: Extra RX 00 > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > > > > fsl_espi ffe110000.spi: tx 05 > > > > > fsl_espi ffe110000.spi: rx 00 > > > > > fsl_espi ffe110000.spi: Extra RX 03 > > > > > fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set! > > > > > fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty! > > > > > fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > > > > fsl_espi ffe110000.spi: tx 05 > > > > > fsl_espi ffe110000.spi: rx 00 > > > > > fsl_espi ffe110000.spi: Extra RX 03 > > > > > > > > > > > > > > > The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with > > > > > either a READ_SR or a READ_FSR. Never a data read. > > > > > > > > > Just remembered something about SPIE_DON: > > > > Transfers are always full duplex, therefore in case of a read the chip > > > > sends dummy zero's. Having said that in case of a read SPIE_DON means > > > > that the last dummy zero was shifted out. > > > > > > > > READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in. > > > > So the issue may have a dependency on the length of the transfer. > > > > However I see no good explanation so far. You can try adding a delay of > > > > a few miroseconds between the following to commands in fsl_espi_bufs(). > > > > > > > > fsl_espi_write_reg(espi, ESPI_SPIM, mask); > > > > > > > > /* Prevent filling the fifo from getting interrupted */ > > > > spin_lock_irq(&espi->lock); > > > > > > > > Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close. > > > I think this might be heading in the right direction. Playing about with > > > a delay does seem to make the two symptoms less likely. Although I have > > > to set it quite high (i.e. msleep(100)) to completely avoid any > > > possibility of seeing either message. > > The patch might replay the interrupt a little bit faster, but it would > > be a few microseconds at most I think (just from improved code). > > > > Would you be able to ftrace the interrupt handler function and see if you > > can see a difference in number or timing of interrupts? I'm at a bit of > > a loss. > > I tried ftrace but I really wasn't sure what I was looking for. > Capturing a "bad" case was pretty tricky. But I think I've identified a > fix (I'll send it as a proper patch shortly). The gist is > > diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c > index 7e7c92cafdbb..cb120b68c0e2 100644 > --- a/drivers/spi/spi-fsl-espi.c > +++ b/drivers/spi/spi-fsl-espi.c > @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi > *espi, u32 events) > static irqreturn_t fsl_espi_irq(s32 irq, void *context_data) > { > struct fsl_espi *espi = context_data; > - u32 events; > + u32 events, mask; > > spin_lock(&espi->lock); > > /* Get interrupt events(tx/rx) */ > events = fsl_espi_read_reg(espi, ESPI_SPIE); > - if (!events) { > + mask = fsl_espi_read_reg(espi, ESPI_SPIM); > + if (!(events & mask)) { > spin_unlock(&espi->lock); > return IRQ_NONE; > } > > The SPIE register contains the TXCNT so events is pretty much always > going to have something set. By checking events against what we've > actually requested interrupts for we don't see any spurious events. > > I've tested this on the T2080RDB and on our custom hardware and it seems > to resolve the problem. > I looked at the fsl_espi_irq() too and noticed that clearing of the IRQ events are after processing TX/RX. That looks a bit odd to me. Jocke