Re: Re[2]: Re-writing the 2.6.11.8 Kernel IPsec stack for hardware crypto offload

Evgeniy Polyakov <johnpol@xxxxxxxxxxx> · Fri, 20 May 2005 13:15:20 +0400

On Fri, 2005-05-20 at 09:49 +0100, Dan Searle wrote:
> Hi,
> 
> Please see my comments inline below...
> 
> Thursday, May 19, 2005, 3:12:24 PM, you wrote:
> 
> > On Thu, 2005-05-19 at 14:47 +0100, Dan Searle wrote:
> >> Hi,
> >> 
> >> The problem with the existing IPsec stack and using OCF is that, for
> >> some reason, a scatter gather list is used to split the packet data
> >> into cypher block size chunks before sending them to the crypto API.
> >> This is very bad for the IXP425 hardware crypto accelerator which
> >> works much faster if you send it the entire plain text in one go
> >> rather than dispatching the cypher block size chunks each in a
> >> separate crypto context.
> >> 
> >> For instance, I wrote a hardware crypto module which integrated with
> >> OCF and the IPsec stack, but because the chunks of plain text were
> >> dispatched to the hardware crypto engine in small (cypher block size)
> >> chunks, the throughput was only about 1Mbit/sec!!!!
> >> 
> >> I have recently thrown out the part of the IPsec stack which splits
> >> the payloads using the scatter gather kernel functions, so that it
> >> dispatches the entire plain text payload in one chunk. I.e. I dispatch
> >> the entire packet to the IXP425 in one go. Using this method I'm now
> >> getting about 20Mbit/sec throughput.
> 
> > Synchronous crypto stack splits original data to the small chunks,
> > unfortunately.
> > Although Herbert Xu moved away from it in the latest patches for crypto
> > tree,
> > but it is usefull for "synchronous" crypto devices like VIA/freescale
> > processors.
> > You definitely should not adopt your driver for IXP to the synchronous
> > API,
> > concider using OCF or acrypto directly from esp/ah output functions
> > [actually I plan to release some code for it this weekend for acrypto
> > and ESP].
> 
> I want to avoid adding the extra layer of complexity that OCF brings,
> I want to call the IXP400 access libs crypto API directly from the
> IPsec stack.

It's your decision, although it has some problems - it can not be used
with outher crypto devices and/or SW implemented algos.

> We already have the ESP output functions calling the IXP400 crypto API
> directly, however, the problem is that... By the time the ESP output
> function returns, doesn't the IPsec stack need the SKB to be finished
> with? I.e. I thought that we had to "busy wait" in a loop within the
> ESP output function until the crpyto API calls us back with the cypher
> text, so that, by the time we exit the ESP output function, we have a
> crypted SKB.
> 
> Perhapse I'm missing something. Can we exit from the ESP output
> function before the SKB is actually crypted? How can we lock a

Yes we can.

> particular SKB until we are called back by the IXP400 crypto API
> telling us the crypto is complete?

I will describe it inline below.

> >> This is still far from optimal, because of the context I'm in and the
> >> way I'm performing the crypto dispatch, it means I have to sit the
> >> kernel in a busy loop until the IXP425 calls me back with the cypher
> >> text.
> >> 
> >> What I'm now trying to do is split the stack somehow, so that one
> >> thread puts plain text skb's into the IXP425 dispatcher, and a new
> >> thread sits on the other side and pulls the encrypted skb's out when
> >> they are ready. Eliminating the need for a busy loop in a single
> >> thread. Although, I'm not at all sure how to do this.
> 
> > You do not need to busywait until your crypto driver finishes the work.
> > Concider link I sent before: there is dst entry tricky split
> > which allows [thanks to it's stackability] asynchronous processing.
> > I.e.
> > skb_clone();
> > setup crypto
> > return 0; // here network stack thinks that skb is processed and queued
> > to be sent to devices xmit fucntion.
> 
> > but since we cloned skb, we may return to it's processing later, 
> > for example from crypto finish callback.
> 
> > I.e.
> > callback()
> > {
> >   skb = some_priv_data;
> 
> >   setup_new_dst_entry;
> >   dst_output(skb);
> > }
> 
> Oops, looks like I should learn to read all of an email before
> replying to one! However, I've just looked at the ESP output function
> again, and I don't see it calling dst_output(). Perhaps I'm being
> very dim and missed something here, but I can see exactly what you
> mean, thanks for the pointer!
> 
> > It has disadvantage that there is no ability to inform original
> > caller (xfrm) that skb was not processed due to some error.
> 
> This is not likely to happen unless we run into an OOM error, in which
> case a few dropped frames are not your biggest problem anyway!

Here is description of asynchronous IPsec processing.
This patch I sent couple of weeks ago to netdev@ with benchmark.
Links were provided in previous e-mails.


diff -ru ../linux-2.6-orig/net/ipv4/esp4.c ./net/ipv4/esp4.c

--- ../linux-2.6-orig/net/ipv4/esp4.c	2005-04-25 15:41:39.000000000 +0400
+++ ./net/ipv4/esp4.c	2005-04-29 14:34:10.000000000 +0400
@@ -7,6 +7,7 @@
 #include <linux/crypto.h>
 #include <linux/pfkeyv2.h>
 #include <linux/random.h>
+#include <linux/timer.h>
 #include <net/icmp.h>
 #include <net/udp.h>
 
@@ -17,6 +18,95 @@
 	__u8		proto;
 };
 
+static int esp_output(struct xfrm_state *x, struct sk_buff *skb);
+
+struct esp_async {
+	struct timer_list	tm;
+	struct sk_buff		*skb;
+	struct xfrm_state	*x;
+	struct dst_entry	*dst;
+};
+
+static void esp4_callback(unsigned long data)
+{
+	struct esp_async *ea = (struct esp_async *)data;
+	struct sk_buff *skb = ea->skb; 
+	struct dst_entry *dst = ea->dst;
+	struct xfrm_state *x = ea->x;
+	int err;
+
+	printk("%s: skb=%p, skb->users=%d.\n", __func__, skb, atomic_read(&skb->users));
+	printk("%s: dst=%p, skb->dst=%p.\n", __func__, dst, skb->dst);
+	printk("%s: xfrm=%p, skb->dst->xfrm=%p.\n", __func__, x, (skb->dst)?skb->dst->xfrm:NULL);
+
+	spin_lock_bh(&x->lock);
+	err = esp_output(x, skb);
+	spin_unlock_bh(&x->lock);
+
+	printk("%s: Data has been processed: err=%d.\n", __func__, err);
+	
+	if (err)
+		goto err_out;
+
+	skb->dst = dst_pop(dst);
+	printk("%s: pop has been finished: skb->dst=%p, dst=%p, skb->users=%d.\n", 
+			__func__, skb->dst, dst, atomic_read(&skb->users));
+	if (!skb->dst)
+		goto err_out;
+	
+	dst_output(skb);
+
+out:
+	kfree(ea);
+	return;
+
+err_out:
+	kfree_skb(skb);
+	goto out;
+}

This is a callback - it process' skb and sets new skb->dst entry and calls dst_output() - 
new dst entry is set to one pointed to ip_ouput(), and dst_output()
will call it. This is exact behaviour of network stack without interruption.
This callback can be called from acrypto/OCF/your driver with encrypted skb->data
and you only need to setup dst entries and call dst_output().

+static int esp_output_async(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct esp_async *ea;
+	struct dst_entry *child;
+
+	printk("%s: enter. Child list: ", __func__);
+	for (child = skb->dst; child; child = child->child)
+		printk("%p [%s] [%d] -> ", child, child->dev->name, atomic_read(&child->__refcnt));
+	printk("\n");
+	
+	ea = kmalloc(sizeof(*ea), GFP_ATOMIC);
+	if (!ea)
+		return -ENOMEM;
+
+	memset(ea, 0, sizeof(*ea));
+
+	skb = skb_clone(skb, GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	dst_hold(skb->dst);
+	
+	ea->skb = skb;
+	ea->x = x;
+	ea->dst = skb->dst;
+
+	printk("%s: x=%p, skb=%p, skb->dst=%p, skb->dst->xfrm=%p.\n",
+			__func__, x, skb, skb->dst, skb->dst->xfrm);
+
+	init_timer(&ea->tm);
+	ea->tm.function = &esp4_callback;
+	ea->tm.data = (unsigned long)ea;
+	ea->tm.expires = jiffies;
+
+	add_timer(&ea->tm);
+
+	printk("%s: timer added: skb->users=%d.\n", __func__, atomic_read(&skb->users));
+
+	return 0;
+
+}

This is an method which will be called from xfrm->output method.
xfrm->output() is called when (for example)tcp_sendmsg() ends up in dst_output().
Next in dst_entry stack is ip_output(), which returns 0 - that indicate
that skb is queued to be sent to NIC driver.

Normal (synchronous) xfrm output method must return NET_XMIT_BYPASS, 
it will indicate that this dst entry has been processed, and it's time
to call next one in dst stack [ip_output()].

What I do here is just faking the stack and do not allow dst_output()
to call ip_output() after xfrm->output() is called.
Since I clone skb, it's data will not be freed by stack.
So in a timer callback [what you want here is to setup either acrypto, or OCF,
or your own IXP processing engine] this skb will be reinjected into the stack.

+
+
 static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
 {
 	int err;
@@ -465,7 +555,7 @@
 	.get_max_size	= esp4_get_max_size,
 	.input		= esp_input,
 	.post_input	= esp_post_input,
-	.output		= esp_output
+	.output		= esp_output_async
 };
 
 static struct net_protocol esp4_protocol = {
diff -ru ../linux-2.6-orig/net/ipv4/xfrm4_output.c ./net/ipv4/xfrm4_output.c
--- ../linux-2.6-orig/net/ipv4/xfrm4_output.c	2005-04-25 15:41:40.000000000 +0400
+++ ./net/ipv4/xfrm4_output.c	2005-04-29 12:13:41.000000000 +0400
@@ -124,12 +124,6 @@
 	x->curlft.packets++;
 
 	spin_unlock_bh(&x->lock);
-	
-	if (!(skb->dst = dst_pop(dst))) {
-		err = -EHOSTUNREACH;
-		goto error_nolock;
-	}
-	err = NET_XMIT_BYPASS;
 
 out_exit:
 	return err;

Here is a trick - since we return 0 here(value from esp_output_async())
instead of NET_XMIT_BYPASS, network stack will not process next dst entry(ip_output()) 
and will "forget" about this skb.
Later we must either free it or call remained dst entries using dst_output().


> Thanks again, Dan....

Hope this helps.

-- 
        Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski
Attachment:
signature.asc

Description: This is a digitally signed message part