Re: Fwd: crypto accelerator driver problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 4, 2011 at 11:27 AM, Steffen Klassert
<steffen.klassert@xxxxxxxxxxx> wrote:
>
> On Sat, Oct 01, 2011 at 12:38:19PM +0330, Hamid Nassiby wrote:
> >
> > And my_cbc_encrypt function as PSEUDO/real code (for simplicity of
> > representation) is as:
> >
> > static int
> > my_cbc_encrypt(struct blkcipher_desc *desc,
> >                 struct scatterlist *dst, struct scatterlist *src,
> >                 unsigned int nbytes)
> > {
> >               SOME__common_preparation_and_initializations;
> >
> >               spin_lock_irqsave(&myloc, myflags);
> >               send_request_to_device(&dev); /*sends request to device. After
> >                                           processing request,device writes
> >                                           result to destination*/
> >               while(!readl(complete_flag)); /*here we wait for a flag in
> >                         device register space indicating completion. */
> >               spin_unlock_irqrestore(&mylock, myflags);
> >
> >
> > }
>
> As I told you already in the private mail, it makes not too much sense
> to parallelize the crypto layer and to hold a global lock during the
> crypto operation. So if you really need this lock, you are much better
> off without a parallelization.
>
Hi Steffen,
Thanks for your reply :).

It makes sense in two manners:
1. If request transmit time to device is much shorter than request
processing time
 spent in device and the device has more than one processing engine.

 2. It also can be advantageous when device has only one processing
engine and we
have multiple blkcipher requests pending behind entrance port of device,
because delay between request entrances to device will be shorter. The overall
advantage will be that our IPSec throughput gets nearer to our device bulk
encryption throughput. (It is interesting to note that with our
current driver and device
configuration, if I test gateway throughput with a traffic belonging to two SAs,
traveling through one link that connects them, I'll get a rate about
280Mbps(80Mbps
increase in comparison with one SA's traffic), while our device's bulk
processing is
about 400Mbps.)

Currently we want to take advantage of the latter case and then extend it.

>
>
>
> >
> > With above code, I can successfully test IPSec gateway equipped with our
> > hardware and get a 200Mbps throughput using Iperf. Now I am facing with another
> > poblem. As I mentioned earlier, our hardware has 4 aes engines builtin. With
> > above code I only utilize one of them.
> > >From this point, we want to go a step further and utilize more than one aes
> > engines of our device. Simplest solution appears to me is to deploy
> > pcrypt/padata, made by Steffen Klassert. First instantiate in a dual
> > core gateway :
> >       modprobe tcrypt alg="pcrypt(authenc(hmac(md5),cbc(aes)))" type=3
> >  and test again. Running Iperf now gives me a very low
> > throughput about 20Mbps while dmesg shows the following:
> >
> >    BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000001/10
> >        last function: padata_parallel_worker+0x0/0x80
>
> This looks like the parallel worker exited in atomic context,
> but I can't tell you much more as long as you don't show us your code.

OK, I represented code as PSEUSO, just to simplify and concentrate problem's
aspects ;),  (but it is also possible that I've concentrated it in a
wrong way :D)
This is my_cbc_encrypt code and functions it calls, bottom-up:

int write_request(u8 *buff, unsigned int count)
{

	u32  tlp_size = 32;
	struct my_dma_desc *desc_table = (struct my_dma_desc *)global_bar[0];
	tlp_size = (count/128) | (tlp_size << 16);
	memcpy(g_mydev->rdmaBuf_va, buff, count);
	wmb();

	writel(cpu_to_le32(tlp_size),(&desc_table->wdmaperf));
	wmb();

	while((readl(&desc_table->ddmacr) | 0xFFFF0000)!= 0xFFFF0101);/*wait for
 						transfer compeltion*/
	return 0;
}

 int my_transform(struct my_aes_op *op, int alg)
{

		int  req_len, err;
		unsigned long iflagsq, tflag;
		u8 *req_buf = NULL, *res_buf = NULL;
		alg_operation operation;
		if (op->len == 0)
			return 0;
		operation = !(op->dir);

		create_request(alg, op->mode, operation, 0, op->key,
			  op->iv, op->src, op->len, &req_buf, &req_len); /*add
			header to original request and copy it to req_buf*/

 		spin_lock_irqsave(&glock, tflag);
		
		write_request(req_buf, req_len);/*now req_buf is sent to device
				, device en/decrypts request and writes the
				the result to a fixed dma mapped address*/
		if (err){
			printk(KERN_EMERG"Error WriteReuest:errcode=%d\n", err);
			//handle exception (never occured)
		}
		kfree(req_buf);
		req_buf = NULL;

		memcpy(op->dst, (g_mydev->wdmaBuf_va, op->len);/*copy result from
			 fixed coherent dma mapped memory to destination*/
		spin_unlock_irqrestore(&glock, tflag);
		
		return op->len;
}

static int
my_cbc_encrypt(struct blkcipher_desc *desc,
		  struct scatterlist *dst, struct scatterlist *src,
		  unsigned int nbytes)
{
	struct my_aes_op *op = crypto_blkcipher_ctx(desc->tfm);
	struct blkcipher_walk walk;
	int err, ret;
	unsigned long c2flag;
	if (unlikely(op->keylen != AES_KEYSIZE_128))
		return fallback_blk_enc(desc, dst, src, nbytes);


	blkcipher_walk_init(&walk, dst, src, nbytes);
	err = blkcipher_walk_virt(desc, &walk);
	op->iv = walk.iv;

	while((nbytes = walk.nbytes)) {

		op->src = walk.src.virt.addr,
		op->dst = walk.dst.virt.addr;
		op->mode = AES_MODE_CBC;
		op->len = nbytes /*- (nbytes % AES_MIN_BLOCK_SIZE)*/;
		op->dir = AES_DIR_ENCRYPT;
		ret = my_transform(op, 0);
		nbytes -= ret;
		err = blkcipher_walk_done(desc, &walk, nbytes);
	}

	return err;
}

>
> >
> > I must emphasize again that goal of deploying pcrypt/padata is to have more than
> > one request present in our hardware (e.g. in a quad cpu system we'll have 4
> > encryption and 4 decryption requests sent into our hardware). Also I tried using
> > pcrypt/padata in a single cpu system with one change in pcrypt_init_padata
> > function of pcrypt.c: passing 4 as max_active parameter of alloc_workqueue.
> > In fact I called alloc_workqueue as:
> >
> > alloc_workqueue(name, WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE, 4);
>
> This does not make sense. max_active has to be 1 as we have to care about the
> order of the work items, so we don't want to have more than one work item
> executing at the same time per CPU. And as we run the parallel workers with BHs
> off, it is not even possible to execute more than one work item at the same
> time per CPU.
>

Did you turn BHs off, to prevent deadlocks  between your workqueues and
network's softirqs?
If there is any other thing that will help, I am pleased to hear.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Kernel]     [Gnu Classpath]     [Gnu Crypto]     [DM Crypt]     [Netfilter]     [Bugtraq]

  Powered by Linux