Re: [PATCH 3/3] TI QSPI: optimize transfers for dual and quad read

Vignesh Raghavendra <vigneshr@xxxxxx> · Tue, 10 Dec 2019 18:48:13 +0530

On 10/12/19 6:31 pm, Jean Pihet wrote:
> On Tue, Dec 10, 2019 at 1:55 PM Vignesh Raghavendra <vigneshr@xxxxxx> wrote:
>>
>>
>>
>> On 10/12/19 5:55 pm, Jean Pihet wrote:
>>> Hi Vignesh,
>>>
>>> On Tue, Dec 10, 2019 at 9:40 AM Vignesh Raghavendra <vigneshr@xxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 06/12/19 9:30 pm, Jean Pihet wrote:
>>>>> By reading the 32 bits data register and copy the contents to the
>>>>> receive buffer, according to the single/dual/quad read mode and
>>>>> the data length to read.
>>>>>
>>>>> The speed improvement is 3.5x using quad read.
>>>>> ---
>>>>>  drivers/spi/spi-ti-qspi.c | 48 ++++++++++++++++++++++++++-------------
>>>>>  1 file changed, 32 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/drivers/spi/spi-ti-qspi.c b/drivers/spi/spi-ti-qspi.c
>>>>> index 13916232a959..65ec3bcb25ae 100644
>>>>> --- a/drivers/spi/spi-ti-qspi.c
>>>>> +++ b/drivers/spi/spi-ti-qspi.c
>>>>> @@ -313,24 +313,25 @@ static int qspi_write_msg(struct ti_qspi *qspi, struct spi_transfer *t,
>>>>>  static int qspi_read_msg(struct ti_qspi *qspi, struct spi_transfer *t,
>>>>>                        int count)
>>>>>  {
>>>>> -     int wlen;
>>>>>       unsigned int cmd;
>>>>> +     u32 rx;
>>>>>       u8 *rxbuf;
>>>>> +     u8 xfer_len;
>>>>>
>>>>>       rxbuf = t->rx_buf;
>>>>>       cmd = qspi->cmd;
>>>>> +     /* Optimize the transfers for dual and quad read */
>>>>>       switch (t->rx_nbits) {
>>>>> -     case SPI_NBITS_DUAL:
>>>>> -             cmd |= QSPI_RD_DUAL;
>>>>> -             break;
>>>>>       case SPI_NBITS_QUAD:
>>>>> -             cmd |= QSPI_RD_QUAD;
>>>>> +             cmd |= QSPI_RD_QUAD | QSPI_WLEN(32);
>>>>
>>>> Why does QUAD mode mean 32 bit words?
>>> This is based on the assumption that every transfer is multiple of 8
>>> clock cycles.
>>> For SPI flash devices where bits_per_word is 8, the original code
>>> always reads data by 1 byte, which
>>> can be optimized.
>>>
>>>>
>>>> IO mode and word size are independent of each other. If you want to
>>>> optimize for speed, why not set FLEN field to max and WLEN to max and
>>>> use ti_qspi_adjust_op_size to clamp max read size to 4096 words when
>>>> required. This should work irrespective of IO modes.
>>>> Driver already does something similar to this in the write path.
>>>
>>> Ok! A new patch series is coming.
>>>
>>>>
>>>>> +             break;
>>>>> +     case SPI_NBITS_DUAL:
>>>>> +             cmd |= QSPI_RD_DUAL | QSPI_WLEN(16);
>>>>>               break;
>>>>>       default:
>>>>> -             cmd |= QSPI_RD_SNGL;
>>>>> +             cmd |= QSPI_RD_SNGL | QSPI_WLEN(8);
>>>>>               break;
>>>>>       }
>>>>> -     wlen = t->bits_per_word >> 3;   /* in bytes */
>>>>>
>>>>>       while (count) {
>>>>>               dev_dbg(qspi->dev, "rx cmd %08x dc %08x\n", cmd, qspi->dc);
>>>>> @@ -342,19 +343,34 @@ static int qspi_read_msg(struct ti_qspi *qspi, struct spi_transfer *t,
>>>>>                       dev_err(qspi->dev, "read timed out\n");
>>>>>                       return -ETIMEDOUT;
>>>>>               }
>>>>> -             switch (wlen) {
>>>>> -             case 1:
>>>>> -                     *rxbuf = readb(qspi->base + QSPI_SPI_DATA_REG);
>>>>> +
>>>>> +             /* Optimize the transfers for dual and quad read */
>>>>> +             rx = readl(qspi->base + QSPI_SPI_DATA_REG);
>>>>> +             switch (t->rx_nbits) {
>>>>> +             case SPI_NBITS_QUAD:
>>>>> +                     if (count >= 1)
>>>>> +                             *rxbuf++ = rx >> 24;
>>>>> +                     if (count >= 2)
>>>>> +                             *rxbuf++ = rx >> 16;
>>>>> +                     if (count >= 3)
>>>>> +                             *rxbuf++ = rx >> 8;
>>>>> +                     if (count >= 4)
>>>>> +                             *rxbuf++ = rx;
>>>>> +                     xfer_len = min(count, 4);
>>>>>                       break;
>>>>> -             case 2:
>>>>> -                     *((u16 *)rxbuf) = readw(qspi->base + QSPI_SPI_DATA_REG);
>>>>> +             case SPI_NBITS_DUAL:
>>>>> +                     if (count >= 1)
>>>>> +                             *rxbuf++ = rx >> 8;
>>>>> +                     if (count >= 2)
>>>>> +                             *rxbuf++ = rx;
>>>>> +                     xfer_len = min(count, 2);
>>>>>                       break;
>>>>> -             case 4:
>>>>> -                     *((u32 *)rxbuf) = readl(qspi->base + QSPI_SPI_DATA_REG);
>>>>> +             default:
>>>>> +                     *rxbuf++ = rx;
>>>>> +                     xfer_len = 1;
>>>>>                       break;
>>>>
>>>> This will fail in case of t->rx_nbits = 1 and t->bits_per_word = 32
>>>> (1 bit SPI bus bit slave with 32-bit addressable registers)
>>> I do not see why this would fail. If bits_per_word is 32, count (in
>>> bytes) is a multiple of 4 and 1 byte will be read every time the loop
>>> runs.
>>> Is that correct? Can you elaborate more on why this would fail?
>>>
>>
>> With t->rx_nbits = 1 and t->bits_per_word = 32, controller should always
>> read in 4 byte chunks on the SPI bus, but we have:
>>
>>>>> +             cmd |= QSPI_RD_SNGL | QSPI_WLEN(8);
>>
>> which sets word size to 8 bits and thus breaks the transaction to 1 byte
>> chunks on the bus, which is bad.
> 
> In that case there are 4 1-byte reads on the bus, which succeeds. The
> trace on the logic analyzer shows 4 times 8 clock cycles.
> 

Ah, sorry, there is no CS toggle, so this case will work. Although its
less efficient as you could have set WLEN to 32 and read entire
QSPI_SPI_DATA_REG in one transaction.

But this patch definitely changes the behvior when t->rx_nbits = 4 and
t->bits_per_word = 32. Previous code did:

               *((u32 *)rxbuf) = readl(qspi->base + QSPI_SPI_DATA_REG);

This patch does:

+             rx = readl(qspi->base + QSPI_SPI_DATA_REG);
		[...]
+             case SPI_NBITS_QUAD:
+                     if (count >= 1)
+                             *rxbuf++ = rx >> 24;
+                     if (count >= 2)
+                             *rxbuf++ = rx >> 16;
+                     if (count >= 3)
+                             *rxbuf++ = rx >> 8;
+                     if (count >= 4)
+                             *rxbuf++ = rx;

So there is change in the endianess...

-- 
Regards
Vignesh