On Thu, Jun 23, 2016 at 06:40:41PM -0700, Megha Dey wrote: > From: Megha Dey <megha.dey@xxxxxxxxxxxxxxx> > > In this patch series, we introduce the multi-buffer crypto algorithm on > x86_64 and apply it to SHA256 hash computation. The multi-buffer technique > takes advantage of the 8 data lanes in the AVX2 registers and allows > computation to be performed on data from multiple jobs in parallel. > This allows us to parallelize computations when data inter-dependency in > a single crypto job prevents us to fully parallelize our computations. > The algorithm can be extended to other hashing and encryption schemes > in the future. > > On multi-buffer SHA256 computation with AVX2, we see throughput increase > up to 2.2x over the existing x86_64 single buffer AVX2 algorithm. > > The multi-buffer crypto algorithm is described in the following paper: > Processing Multiple Buffers in Parallel to Increase Performance on > Intel® Architecture Processors > http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html > > The outline of the algorithm is sketched below: > Any driver requesting the crypto service will place an async > crypto request on the workqueue. The multi-buffer crypto daemon will > pull request from work queue and put each request in an empty data lane > for multi-buffer crypto computation. When all the empty lanes are filled, > computation will commence on the jobs in parallel and the job with the > shortest remaining buffer will get completed and be returned. To prevent > prolonged stall when there is no new jobs arriving, we will flush a crypto > job if it has not been completed after a maximum allowable delay. > > To accommodate the fragmented nature of scatter-gather, we will keep > submitting the next scatter-buffer fragment for a job for multi-buffer > computation until a job is completed and no more buffer fragments remain. > At that time we will pull a new job to fill the now empty data slot. > We call a get_completed_job function to check whether there are other > jobs that have been completed when we job when we have no new job arrival > to prevent extraneous delay in returning any completed jobs. > > The multi-buffer algorithm should be used for cases where crypto jobs > submissions are at a reasonable high rate. For low crypto job submission > rate, this algorithm will not be beneficial. The reason is at low rate, > we do not fill out the data lanes before the maximum allowable latency, > we will be flushing the jobs instead of processing them with all the > data lanes full. We will miss the benefit of parallel computation, > and adding delay to the processing of the crypto job at the same time. > Some tuning of the maximum latency parameter may be needed to get the > best performance. > > Note that the tcrypt SHA256 speed test, we wait for a previous job to > be completed before submitting a new job. Hence this is not a valid > test for multi-buffer algorithm as it requires multiple outstanding jobs > submitted to fill the all data lanes to be effective (i.e. 8 outstanding > jobs for the AVX2 case). An updated version of the tcrypt test is also > included which would contain a more appropriate test for this scenario. > > As this is the first algorithm in the kernel's crypto library > that we have tried to use multi-buffer optimizations, feedbacks > and testings will be much appreciated. All applied. Thanks. -- Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html