Re: dm-crypt performance and dual-core cpus?

Clemens Fruhwirth <clemens@xxxxxxxxxxxxx> · Thu, 09 Nov 2006 20:53:58 +0100

At Thu, 9 Nov 2006 15:22:54 +0100,
"Marco Costa" <costa.marco@xxxxxxxxx> wrote:
> 
> Hi!
> 
> On 11/9/06, Gerald Hopf <dm-crypt@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > I am currently using AES-128 encryption and am able to get about 31MB/s
> > sustained performance on my single-core athlon 64 3000+ (when copying
> > from the encrypted volume to a fast non-encrypted volume).
> >
> > Since 31MB/s is will below the limits of gigabit ethernet, i would like
> > to get even more performance :-)
> 
> It seems that you are already on the hardware limit of the harddrive.
> The overhead of encrypting your filesystem is very tiny.
> I may be wrong, though. ;-)

That's completely wrong. You are not even close to the hardware limit,

/dev/hda:
 Timing buffered disk reads:  168 MB in  3.03 seconds =  55.48 MB/sec

(regular 7200rpm disk)

Even worse, you are not even close to the processors encryption speed
limit:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      76228.13k    78740.37k    81087.48k    80974.17k    81265.19k

I happen to have the same processor as Gerald, and 81mb/s is the
encryption speed for 128 bit key AES in CBC mode.

The overall problem with dm-crypt (and cryptoloop and all stuff
before) is that decryption is done by a seperate worker thread. The
reason for that is that reading needs post-processing, namely
decryption that can not be done in the interrupt handler that is
called when the reading finishes. Hence, work is offloaded to a worker
thread and that worker thread, when finished with decryption calls the
I/O completion handler that usual wakes up the calling process.

This additional scheduling seems to slow down the whole process
sufficiently that the performance is horrible. For writing, you should
get full speed because writing does not need any post processing. The
grave difference can be seen in most of the performance benchmarks
posted for dm-crypt.

I have spent a few quaters of an hour on thinking about potential
solutions. The overall goal is clear: do the postprocessing in the
process that has cause the I/O, so context switching can be
avoided. My idea is: 

1) add a CAN_POSTPROCESS flag to I/O requests, as well as postprocess
function pointer, that is initialized to a function that just does
nothing.

2) Whenever dm-crypt sees a request that has the CAN_POSTPROCESS flags
on it, it does not enqueue the work for the worker thread, but instead
modifies the postprocessing handler to point to its own decryption
routine. The decryption routine would of course call the original
postprocessing handler when finished.

3) Profile the kernel code for hot spots in terms of I/O request
submission. Modify this code from 

do_io_and_sleep(request);

to

request->flags |= CAN_POSTPROCESS;
do_io_and_sleep(request); 
request->postprocess(..);

That effectively offloads work to the processes causing I/O. It also
prevents the additional scheduling and work queuing with
dm-crypt. The approach to flag the capabilities of the caller allows
seamless transition to the new model.

I'm posting this idea in a total premature state. I know that, but I
have no intention to implement it, as I'm done with kernel
hacking. But if someone is eager to work out the details, here is at
least an inspiration how it could possibly work.

To answer the original question: dual-core will not improve your
dm-crypt performance (at least for now) because there is only one
kcryptd thread per device. This might change with the workqueue stuff
that is about to be merged. I'm not following that development.
-- 
Fruhwirth Clemens - http://clemens.endorphin.org 
for robots: sp4mtrap@xxxxxxxxxxxxx

---------------------------------------------------------------------
dm-crypt mailing list - http://www.saout.de/misc/dm-crypt/
To unsubscribe, e-mail: dm-crypt-unsubscribe@xxxxxxxx
For additional commands, e-mail: dm-crypt-help@xxxxxxxx