On Wed, Feb 12, 2020 at 5:20 AM Horia Geanta <horia.geanta@xxxxxxx> wrote: > > On 1/27/2020 6:57 PM, Andrey Smirnov wrote: > > Rework CAAM RNG implementation as follows: > > > > - Make use of the fact that HWRNG supports partial reads and will > > handle such cases gracefully by removing recursion in caam_read() > > > > - Convert blocking caam_read() codepath to do a single blocking job > > read directly into requested buffer, bypassing any intermediary > > buffers > > > > - Convert async caam_read() codepath into a simple single > > reader/single writer FIFO use-case, thus simplifying concurrency > > handling and delegating buffer read/write position management to KFIFO > > subsystem. > > > > - Leverage the same low level RNG data extraction code for both async > > and blocking caam_read() scenarios, get rid of the shared job > > descriptor and make non-shared one as a simple as possible (just > > HEADER + ALGORITHM OPERATION + FIFO STORE) > > > > - Split private context from DMA related memory, so that the former > > could be allocated without GFP_DMA. > > > > NOTE: On its face value this commit decreased throughput numbers > > reported by > > > > dd if=/dev/hwrng of=/dev/null bs=1 count=100K [iflag=nonblock] > > > > by about 15%, however commits that enable prediction resistance and > Running dd as mentioned above, on a i.MX8MM board I see: > ~ 20% decrease in non-blocking case (525 kB/s vs. 662 kB/s) > ~ 75% decrease in blocking case (170 kB/s vs. 657 kB/s) > > bs=1 is a bit drastic. > Using bs=16 the numbers look better in terms of overall speed, > however the relative degradation is still there: > ~ 66% decrease in blocking case (3.5 MB/s vs. 10.1 MB/s) > > > limit JR total size impact the performance so much and move the > > bottleneck such as to make this regression irrelevant. > > > Yes, performance is greatly impacted by moving from a DRBG configuration > to a TRNG one. > > The speed that I get with this patch set (1.3 kB/s) > is ~ 20% lower than theoretical output (1.583 kB/s) (see below). > Seeing this and also the relative decrease in case of DRBG > makes me wonder whether the SW overhead could be lowered. > > Theoretical TRNG output speed in this configuration > can be computed as: > Speed = (SZ x CAAM_CLK_FREQ) / (RTSDCTL[ENT_DLY] x RTSDCTL[SAMP_SIZE]) [bps] > > SZ is sample taken from the DRBG, b/w two consecutive reseedings. > As previously discussed, this is limited to 128 bits (16 bytes), > such that the DRBG behaves as a TRNG. > > If: > -CAAM_CLK_FREQ = 166 MHz (as for i.MXM*) > -RTSDCTL[ENT_DLY] = 3200 clocks (default / POR value) > -RTSDCTL[SAMP_SIZE] = 512 (recommended; default / POR value is 2500) > then theoretical speed is 1.583 kB/s. > > > @@ -45,38 +22,34 @@ > > #include "jr.h" > > #include "error.h" > > > > +/* length of descriptors */ > This comment is misplaced, length of descriptors (CAAM_RNG_DESC_LEN) > is further below. > > > +#define CAAM_RNG_MAX_FIFO_STORE_SIZE U16_MAX > > + > > +#define CAAM_RNG_FIFO_LEN SZ_32K /* Must be a multiple of 2 */ > > + > > /* > > - * Maximum buffer size: maximum number of random, cache-aligned bytes that > > - * will be generated and moved to seq out ptr (extlen not allowed) > > + * See caam_init_desc() > > */ > > -#define RN_BUF_SIZE (0xffff / L1_CACHE_BYTES * \ > > - L1_CACHE_BYTES) > > +#define CAAM_RNG_DESC_LEN (CAAM_CMD_SZ + \ > > + CAAM_CMD_SZ + \ > > + CAAM_CMD_SZ + CAAM_PTR_SZ_MAX) > > > +typedef u8 caam_rng_desc[CAAM_RNG_DESC_LEN]; > Is this really necessary? > Will drop in v8. > > -static int caam_read(struct hwrng *rng, void *data, size_t max, bool wait) > > +static int caam_rng_read_one(struct device *jrdev, > > + void *dst, int len, > > + void *desc, > > + struct completion *done) > [...] > > + len = min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE); > For the blocking case, i.e. caam_read() -> caam_rng_read_one(), > "len" is at least 32B - cf. include/linux/hw_random.h: > * @read: New API. drivers can fill up to max bytes of data > * into the buffer. The buffer is aligned for any type > * and max is a multiple of 4 and >= 32 bytes. > > For reducing the SW overhead, it might be worth optimizing this path. > For example, considering > min_t(int, len, CAAM_RNG_MAX_FIFO_STORE_SIZE) = CAAM_RNG_MAX_FIFO_STORE_SIZE This isn't true until next commit. Here CAAM_RNG_MAX_FIFO_STORE_SIZE is still 32K, so "len" is going to be smaller. I'll update the code to assume fixed length once CAAM_RNG_MAX_FIFO_STORE_SIZE becomes 16 bytes. > this means length is fixed, thus also ctx->desc[DESC_SYNC] descriptor is fixed > and its generation could be moved out of the hot path. That descriptor also includes DMA address of a buffer we are given, so the descriptor is never fixed. Thanks, Andrey Smirnov