Re: [PATCH 3/3] TI QSPI: optimize transfers for dual and quad read

Jean Pihet <jean.pihet@xxxxxxxxxxxxxx> · Tue, 10 Dec 2019 16:08:21 +0100



Vignesh,

On Tue, Dec 10, 2019 at 2:17 PM Vignesh Raghavendra <vigneshr@xxxxxx> wrote:
>
>
>
> On 10/12/19 6:31 pm, Jean Pihet wrote:
> > On Tue, Dec 10, 2019 at 1:55 PM Vignesh Raghavendra <vigneshr@xxxxxx> wrote:
> >>
> >>
> >>
> >> On 10/12/19 5:55 pm, Jean Pihet wrote:
> >>> Hi Vignesh,
> >>>
> >>> On Tue, Dec 10, 2019 at 9:40 AM Vignesh Raghavendra <vigneshr@xxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 06/12/19 9:30 pm, Jean Pihet wrote:
> >>>>> By reading the 32 bits data register and copy the contents to the
> >>>>> receive buffer, according to the single/dual/quad read mode and
> >>>>> the data length to read.
> >>>>>
> >>>>> The speed improvement is 3.5x using quad read.
> >>>>> ---
> >>>>>  drivers/spi/spi-ti-qspi.c | 48 ++++++++++++++++++++++++++-------------
> >>>>>  1 file changed, 32 insertions(+), 16 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/spi/spi-ti-qspi.c b/drivers/spi/spi-ti-qspi.c
> >>>>> index 13916232a959..65ec3bcb25ae 100644
> >>>>> --- a/drivers/spi/spi-ti-qspi.c
> >>>>> +++ b/drivers/spi/spi-ti-qspi.c
> >>>>> @@ -313,24 +313,25 @@ static int qspi_write_msg(struct ti_qspi *qspi, struct spi_transfer *t,
> >>>>>  static int qspi_read_msg(struct ti_qspi *qspi, struct spi_transfer *t,
> >>>>>                        int count)
> >>>>>  {
> >>>>> -     int wlen;
> >>>>>       unsigned int cmd;
> >>>>> +     u32 rx;
> >>>>>       u8 *rxbuf;
> >>>>> +     u8 xfer_len;
> >>>>>
> >>>>>       rxbuf = t->rx_buf;
> >>>>>       cmd = qspi->cmd;
> >>>>> +     /* Optimize the transfers for dual and quad read */
> >>>>>       switch (t->rx_nbits) {
> >>>>> -     case SPI_NBITS_DUAL:
> >>>>> -             cmd |= QSPI_RD_DUAL;
> >>>>> -             break;
> >>>>>       case SPI_NBITS_QUAD:
> >>>>> -             cmd |= QSPI_RD_QUAD;
> >>>>> +             cmd |= QSPI_RD_QUAD | QSPI_WLEN(32);
> >>>>
> >>>> Why does QUAD mode mean 32 bit words?
> >>> This is based on the assumption that every transfer is multiple of 8
> >>> clock cycles.
> >>> For SPI flash devices where bits_per_word is 8, the original code
> >>> always reads data by 1 byte, which
> >>> can be optimized.
> >>>
> >>>>
> >>>> IO mode and word size are independent of each other. If you want to
> >>>> optimize for speed, why not set FLEN field to max and WLEN to max and
> >>>> use ti_qspi_adjust_op_size to clamp max read size to 4096 words when
> >>>> required. This should work irrespective of IO modes.
> >>>> Driver already does something similar to this in the write path.
> >>>
> >>> Ok! A new patch series is coming.
> >>>
> >>>>
> >>>>> +             break;
> >>>>> +     case SPI_NBITS_DUAL:
> >>>>> +             cmd |= QSPI_RD_DUAL | QSPI_WLEN(16);
> >>>>>               break;
> >>>>>       default:
> >>>>> -             cmd |= QSPI_RD_SNGL;
> >>>>> +             cmd |= QSPI_RD_SNGL | QSPI_WLEN(8);
> >>>>>               break;
> >>>>>       }
> >>>>> -     wlen = t->bits_per_word >> 3;   /* in bytes */
> >>>>>
> >>>>>       while (count) {
> >>>>>               dev_dbg(qspi->dev, "rx cmd %08x dc %08x\n", cmd, qspi->dc);
> >>>>> @@ -342,19 +343,34 @@ static int qspi_read_msg(struct ti_qspi *qspi, struct spi_transfer *t,
> >>>>>                       dev_err(qspi->dev, "read timed out\n");
> >>>>>                       return -ETIMEDOUT;
> >>>>>               }
> >>>>> -             switch (wlen) {
> >>>>> -             case 1:
> >>>>> -                     *rxbuf = readb(qspi->base + QSPI_SPI_DATA_REG);
> >>>>> +
> >>>>> +             /* Optimize the transfers for dual and quad read */
> >>>>> +             rx = readl(qspi->base + QSPI_SPI_DATA_REG);
> >>>>> +             switch (t->rx_nbits) {
> >>>>> +             case SPI_NBITS_QUAD:
> >>>>> +                     if (count >= 1)
> >>>>> +                             *rxbuf++ = rx >> 24;
> >>>>> +                     if (count >= 2)
> >>>>> +                             *rxbuf++ = rx >> 16;
> >>>>> +                     if (count >= 3)
> >>>>> +                             *rxbuf++ = rx >> 8;
> >>>>> +                     if (count >= 4)
> >>>>> +                             *rxbuf++ = rx;
> >>>>> +                     xfer_len = min(count, 4);
> >>>>>                       break;
> >>>>> -             case 2:
> >>>>> -                     *((u16 *)rxbuf) = readw(qspi->base + QSPI_SPI_DATA_REG);
> >>>>> +             case SPI_NBITS_DUAL:
> >>>>> +                     if (count >= 1)
> >>>>> +                             *rxbuf++ = rx >> 8;
> >>>>> +                     if (count >= 2)
> >>>>> +                             *rxbuf++ = rx;
> >>>>> +                     xfer_len = min(count, 2);
> >>>>>                       break;
> >>>>> -             case 4:
> >>>>> -                     *((u32 *)rxbuf) = readl(qspi->base + QSPI_SPI_DATA_REG);
> >>>>> +             default:
> >>>>> +                     *rxbuf++ = rx;
> >>>>> +                     xfer_len = 1;
> >>>>>                       break;
> >>>>
> >>>> This will fail in case of t->rx_nbits = 1 and t->bits_per_word = 32
> >>>> (1 bit SPI bus bit slave with 32-bit addressable registers)
> >>> I do not see why this would fail. If bits_per_word is 32, count (in
> >>> bytes) is a multiple of 4 and 1 byte will be read every time the loop
> >>> runs.
> >>> Is that correct? Can you elaborate more on why this would fail?
> >>>
> >>
> >> With t->rx_nbits = 1 and t->bits_per_word = 32, controller should always
> >> read in 4 byte chunks on the SPI bus, but we have:
> >>
> >>>>> +             cmd |= QSPI_RD_SNGL | QSPI_WLEN(8);
> >>
> >> which sets word size to 8 bits and thus breaks the transaction to 1 byte
> >> chunks on the bus, which is bad.
> >
> > In that case there are 4 1-byte reads on the bus, which succeeds. The
> > trace on the logic analyzer shows 4 times 8 clock cycles.
> >
>
> Ah, sorry, there is no CS toggle, so this case will work. Although its
> less efficient as you could have set WLEN to 32 and read entire
> QSPI_SPI_DATA_REG in one transaction.
>
> But this patch definitely changes the behvior when t->rx_nbits = 4 and
> t->bits_per_word = 32. Previous code did:
>
>                *((u32 *)rxbuf) = readl(qspi->base + QSPI_SPI_DATA_REG);
>
> This patch does:
>
> +             rx = readl(qspi->base + QSPI_SPI_DATA_REG);
>                 [...]
> +             case SPI_NBITS_QUAD:
> +                     if (count >= 1)
> +                             *rxbuf++ = rx >> 24;
> +                     if (count >= 2)
> +                             *rxbuf++ = rx >> 16;
> +                     if (count >= 3)
> +                             *rxbuf++ = rx >> 8;
> +                     if (count >= 4)
> +                             *rxbuf++ = rx;
>
>
> So there is change in the endianess...
Oh! The cases where bits_per_word is different than 8 definitely needs
more thinking. I have tested the patches
with SPI flash only. How to test it with 32 bits per word?

Thanks & regards;
Jean

>
> --
> Regards
> Vignesh