Re: fsl_espi errors on v5.7.15

Nicholas Piggin <npiggin@xxxxxxxxx> · Tue, 01 Sep 2020 16:14:16 +1000

Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> 
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> <snip>
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 70
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe110000.spi: tx 05
>>>>>>>>>>> fsl_espi ffe110000.spi: rx 00
>>>>>>>>>>> fsl_espi ffe110000.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>>     From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>>>>>> 5.9-rc2 by reverting that one commit.
>>>>>>>
>>>>>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>>>>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>>>>>> the issue) and the commit just before (which does not).
>>>>>> Thanks for testing, that confirms it well.
>>>>>>
>>>>>> [snip patch]
>>>>>>
>>>>>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>>>>>> didn't report anything (either with or without the change above).
>>>>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>>>>> else has changed.
>>>>>>
>>>>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>>>>> higher interrupt latency?
>>>>>>
>>>>>> I don't think the patch should cause significantly worse latency,
>>>>>> (it's supposed to be a bit better if anything because it doesn't set
>>>>>> up the full interrupt frame). But it's possible.
>>>>> My working theory is that the SPI_DON indication is all about the TX
>>>>> direction an now that the interrupts are faster we're hitting an error
>>>>> because there is still RX activity going on. Heiner disagrees with my
>>>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>>>> happen every time does throw doubt on it.
>>>>>
>>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>>> TX only. However this wouldn't really make sense, because also for RX
>>>> we program the frame length, and therefore want to be notified once the
>>>> full frame was received. Also practical experience shows that SPI_DON
>>>> is set also after RX-only transfers.
>>>> Typical SPI NOR use case is that you write read command + start address,
>>>> followed by a longer read. If the TX-only interpretation would be right,
>>>> we'd always end up with SPI_DON not being set.
>>>>
>>>>> I can't really explain the extra RX byte in the fifo. We know how many
>>>>> bytes to expect and we pull that many from the fifo so it's not as if
>>>>> we're missing an interrupt causing us to skip the last byte. I've been
>>>>> looking for some kind of off-by-one calculation but again if it were
>>>>> something like that it'd happen all the time.
>>>>>
>>>> Maybe it helps to know what value this extra byte in the FIFO has. Is it:
>>>> - a duplicate of the last read byte
>>>> - or the next byte (at <end address> + 1)
>>>> - or a fixed value, e.g. always 0x00 or 0xff
>>> The values were up thread a bit but I'll repeat them here
>>>
>>> fsl_espi ffe110000.spi: tx 70
>>> fsl_espi ffe110000.spi: rx 03
>>> fsl_espi ffe110000.spi: Extra RX 00
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>> fsl_espi ffe110000.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe110000.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe110000.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe110000.spi: tx 05
>>> fsl_espi ffe110000.spi: rx 00
>>> fsl_espi ffe110000.spi: Extra RX 03
>>>
>>>
>>> The rx 00 Extra RX 03 is a bit concerning. I've only ever seen them with
>>> either a READ_SR or a READ_FSR. Never a data read.
>>>
>> Just remembered something about SPIE_DON:
>> Transfers are always full duplex, therefore in case of a read the chip
>> sends dummy zero's. Having said that in case of a read SPIE_DON means
>> that the last dummy zero was shifted out.
>>
>> READ_SR and READ_FSR are the shortest transfers, 1 byte out and 1 byte in.
>> So the issue may have a dependency on the length of the transfer.
>> However I see no good explanation so far. You can try adding a delay of
>> a few miroseconds between the following to commands in fsl_espi_bufs().
>>
>> 	fsl_espi_write_reg(espi, ESPI_SPIM, mask);
>>
>> 	/* Prevent filling the fifo from getting interrupted */
>> 	spin_lock_irq(&espi->lock);
>>
>> Maybe enabling interrupts and seeing the SPIE_DON interrupt are too close.
> I think this might be heading in the right direction. Playing about with 
> a delay does seem to make the two symptoms less likely. Although I have 
> to set it quite high (i.e. msleep(100)) to completely avoid any 
> possibility of seeing either message.

The patch might replay the interrupt a little bit faster, but it would 
be a few microseconds at most I think (just from improved code).

Would you be able to ftrace the interrupt handler function and see if you
can see a difference in number or timing of interrupts? I'm at a bit of 
a loss.

Thanks,
Nick