Re: [PATCH v7 5/9] crypto: caam - simplify RNG implementation

Andrey Smirnov <andrew.smirnov@xxxxxxxxx> · Mon, 24 Feb 2020 09:16:14 -0800

On Wed, Feb 12, 2020 at 5:20 AM Horia Geanta <horia.geanta@xxxxxxx> wrote:
>
> On 1/27/2020 6:57 PM, Andrey Smirnov wrote:
> > Rework CAAM RNG implementation as follows:
> >
> > - Make use of the fact that HWRNG supports partial reads and will
> > handle such cases gracefully by removing recursion in caam_read()
> >
> > - Convert blocking caam_read() codepath to do a single blocking job
> > read directly into requested buffer, bypassing any intermediary
> > buffers
> >
> > - Convert async caam_read() codepath into a simple single
> > reader/single writer FIFO use-case, thus simplifying concurrency
> > handling and delegating buffer read/write position management to KFIFO
> > subsystem.
> >
> > - Leverage the same low level RNG data extraction code for both async
> > and blocking caam_read() scenarios, get rid of the shared job
> > descriptor and make non-shared one as a simple as possible (just
> > HEADER + ALGORITHM OPERATION + FIFO STORE)
> >
> > - Split private context from DMA related memory, so that the former
> > could be allocated without GFP_DMA.
> >
> > NOTE: On its face value this commit decreased throughput numbers
> > reported by
> >
> >   dd if=/dev/hwrng of=/dev/null bs=1 count=100K [iflag=nonblock]
> >
> > by about 15%, however commits that enable prediction resistance and
> Running dd as mentioned above, on a i.MX8MM board I see:
> ~ 20% decrease in non-blocking case (525 kB/s vs. 662 kB/s)
> ~ 75% decrease in blocking case (170 kB/s vs. 657 kB/s)
>
> bs=1 is a bit drastic.
> Using bs=16 the numbers look better in terms of overall speed,
> however the relative degradation is still there:
> ~ 66% decrease in blocking case (3.5 MB/s vs. 10.1 MB/s)
>
> > limit JR total size impact the performance so much and move the
> > bottleneck such as to make this regression irrelevant.
> >
> Yes, performance is greatly impacted by moving from a DRBG configuration
> to a TRNG one.
>
> The speed that I get with this patch set (1.3 kB/s)
> is ~ 20% lower than theoretical output (1.583 kB/s) (see below).
> Seeing this and also the relative decrease in case of DRBG
> makes me wonder whether the SW overhead could be lowered.
>
> Theoretical TRNG output speed in this configuration
> can be computed as:
> Speed = (SZ x CAAM_CLK_FREQ) / (RTSDCTL[ENT_DLY] x RTSDCTL[SAMP_SIZE]) [bps]
>
> SZ is sample taken from the DRBG, b/w two consecutive reseedings.
> As previously discussed, this is limited to 128 bits (16 bytes),
> such that the DRBG behaves as a TRNG.
>
> If:
> -CAAM_CLK_FREQ = 166 MHz (as for i.MXM*)
> -RTSDCTL[ENT_DLY] = 3200 clocks (default / POR value)
> -RTSDCTL[SAMP_SIZE] = 512 (recommended; default / POR value is 2500)
> then theoretical speed is 1.583 kB/s.
>
> > @@ -45,38 +22,34 @@
> >  #include "jr.h"
> >  #include "error.h"
> >
> > +/* length of descriptors */
> This comment is misplaced, length of descriptors (CAAM_RNG_DESC_LEN)
> is further below.
>
> > +#define CAAM_RNG_MAX_FIFO_STORE_SIZE U16_MAX
> > +
> > +#define CAAM_RNG_FIFO_LEN            SZ_32K /* Must be a multiple of 2 */
> > +
> >  /*
> > - * Maximum buffer size: maximum number of random, cache-aligned bytes that
> > - * will be generated and moved to seq out ptr (extlen not allowed)
> > + * See caam_init_desc()
> >   */
> > -#define RN_BUF_SIZE                  (0xffff / L1_CACHE_BYTES * \
> > -                                      L1_CACHE_BYTES)
> > +#define CAAM_RNG_DESC_LEN (CAAM_CMD_SZ +                             \
> > +                        CAAM_CMD_SZ +                                \
> > +                        CAAM_CMD_SZ + CAAM_PTR_SZ_MAX)
>
> > +typedef u8 caam_rng_desc[CAAM_RNG_DESC_LEN];
> Is this really necessary?
>

Will drop in v8.

> > -static int caam_read(struct hwrng *rng, void *data, size_t max, bool wait)
> > +static int caam_rng_read_one(struct device *jrdev,
> > +                          void *dst, int len,
> > +                          void *desc,
> > +                          struct completion *done)
> [...]
> > +     len = min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE);
> For the blocking case, i.e. caam_read() -> caam_rng_read_one(),
> "len" is at least 32B - cf. include/linux/hw_random.h:
>  * @read:               New API. drivers can fill up to max bytes of data
>  *                      into the buffer. The buffer is aligned for any type
>  *                      and max is a multiple of 4 and >= 32 bytes.
>
> For reducing the SW overhead, it might be worth optimizing this path.
> For example, considering
> min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE) = CAAM_RNG_MAX_FIFO_STORE_SIZE

This isn't true until next commit. Here CAAM_RNG_MAX_FIFO_STORE_SIZE
is still 32K, so "len" is going to be smaller. I'll update the code to
assume fixed length once CAAM_RNG_MAX_FIFO_STORE_SIZE becomes 16
bytes.

> this means length is fixed, thus also ctx->desc[DESC_SYNC] descriptor is fixed
> and its generation could be moved out of the hot path.

That descriptor also includes DMA address of a buffer we are given, so
the descriptor is never fixed.

Thanks,
Andrey Smirnov