On Mon, Jun 27, 2016 at 10:20:03AM -0700, Megha Dey wrote: > From: Megha Dey <megha.dey@xxxxxxxxxxxxxxx> > > In this patch series, we introduce the multi-buffer crypto algorithm on > x86_64 and apply it to SHA512 hash computation. The multi-buffer technique > takes advantage of the 8 data lanes in the AVX2 registers and allows > computation to be performed on data from multiple jobs in parallel. > This allows us to parallelize computations when data inter-dependency in > a single crypto job prevents us to fully parallelize our computations. > The algorithm can be extended to other hashing and encryption schemes > in the future. > > On multi-buffer SHA512 computation with AVX2, we see throughput increase > up to 2x over the existing x86_64 single buffer AVX2 algorithm. > > The multi-buffer crypto algorithm is described in the following paper: > Processing Multiple Buffers in Parallel to Increase Performance on > Intel® Architecture Processors > http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html > > The outline of the algorithm is sketched below: > Any driver requesting the crypto service will place an async > crypto request on the workqueue. The multi-buffer crypto daemon will > pull request from work queue and put each request in an empty data lane > for multi-buffer crypto computation. When all the empty lanes are filled, > computation will commence on the jobs in parallel and the job with the > shortest remaining buffer will get completed and be returned. To prevent > prolonged stall when there is no new jobs arriving, we will flush a crypto > job if it has not been completed after a maximum allowable delay. > > The multi-buffer algorithm necessitates mapping multiple scatter gather > buffers to linear addresses simultaneously. The crypto daemon may need > to sleep and yield the cpu to work on something else from time to time. > We made a change to not use kmap_atomic to do scatter-gather buffer > mapping and take advantage of the fact that we can directly translate > address the buffer's address to its linear address with x86_64. > To accommodate the fragmented nature of scatter-gather, we will keep > submitting the next scatter-buffer fragment for a job for multi-buffer > computation until a job is completed and no more buffer fragments remain. > At that time we will pull a new job to fill the now empty data slot. > We call a get_completed_job function to check whether there are other > jobs that have been completed when we job when we have no new job arrival > to prevent extraneous delay in returning any completed jobs. > > The multi-buffer algorithm should be used for cases where crypto jobs > submissions are at a reasonable high rate. For low crypto job submission > rate, this algorithm will not be beneficial. The reason is at low rate, > we do not fill out the data lanes before the maximum allowable latency, > we will be flushing the jobs instead of processing them with all the > data lanes full. We will miss the benefit of parallel computation, > and adding delay to the processing of the crypto job at the same time. > Some tuning of the maximum latency parameter may be needed to get the > best performance. > > Also added, is a new mode in the tcrypt modules to calculate the speed of the > sha512_mb algorithm. All applied. Thanks. -- Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html