ext3 + dm_crypt performance impact (CentOS 5.4 AMD64)

<Robert.Heinzmann@xxxxxxxxxxxxxxx> · Mon, 27 Dec 2010 18:03:28 +0100

Hello list, 

I'm currently testig 
some dm_crypt performance on CentOS 5.4 (AMD64, Kernel 2.6.21.7) One part 
of the evaluation is a simple end-2-end disk benchmark. I use the Kernel for 
this. The routine is: 

   0) 
Create filesystem on the backend device and mount

   -- record 
starttime
   1) 
Extract Kernel (2.6.37)
   2) Copy 
the extracted kernel once
   3) Copy the extracted kernel 
once
   4) Copy the 
extracted kernel once
   5) Remove the copy 

   6) Remove the copy 

   7) Remove the copy 

   8) Remove the 
original
   9) call 
sync

   -- record endtime 

I do this for plain ext3 filesystem on /dev/sdf and 
ext3 on crypto filesystem (I use dm_crypt and truecrypt as backends). 

I found that if I run this Test with a ext3 filesystem 
created in step (0), performance drops ~40% for dm_crypt (runtime is 40 % 
longer). I also see that the cpu load is not increased, only I/O wait increases, 
but not the overall cpu usage. For me this looks like a latency issue, because 
stil a lot of CPU is free (actually mostly all of it). If I do a /proc/diskstat 
before and after the test, I see that I/O time is much longer in dm_crypt. 

a) /proc/diskstat for one run with plain disk backend: 

  DISK sdf: reads=101 rmerge=0 rsect=808 
rtimems=6366 writes=44052 wmerge=372381 wsect=3321608 wtimems=4993178 
current=158 iotime=36041 
iotimeweighted=5006161

b) /proc/diskstat for one run with dm_crypt disk 
backend (dm-1 is the dm_crypt): 

  DISK sdf: reads=110 rmerge=0 rsect=880 
rtimems=5084 writes=55169 wmerge=421079 wsect=3809984 wtimems=5275315 current=0 
iotime=42127 iotimeweighted=5280393
  DISK dm-1: reads=110 rmerge=0 
rsect=880 rtimems=13457 writes=476248 wmerge=0 wsect=3809984 wtimems=1467043364 
current=0 iotime=42602 
iotimeweighted=1467056820

I also see that the disk writes are split in 4k: 

for a) the average write size is 3281192 / 48726 = 
67,33 = ~34 KB per 
Write

for b) it is dm-1 with ext3: Average Write Size = 
3809984 / 476248 = 8 = 4k per write. 

Because this is cached I/O (buffer cache), dm_requests 
are processed in units of 4k (somehow this seems to be a implementation specific 
thing).Then this small requests are merged again in the scheduler for the 
/dev/sdf backend device. I would expect that this is not such a big issue. 
I tested the system and it can to  80 MB/sec encryption / decryption. The 
backend does ~50 MB/s writes. So I was expecting a performance impact of ~10 
percent. However it seems to be much more (40%). 

When I run the same test with ext2: I see a average 
request size of 22 Sectors (~11k) to the dm_device and merges on the 
backend (46 vs 42 seconds ~ 10% perf. 
impact):
With XFS the same (95 seconds instead of 87 ~10% perf. 
impact, avg I/O size 34 Sectors). Only ext3, is fixed at 4k 
requests.

I also measured via a dm_zero backend the latency 
impact of dm_crypt. It seems it adds ~0.1ms latency to the I/Os for small I/Os 
(4/8k) up to 10ms for 1M (I/O). It also looks like dm_crypt does only scale to 
one CPU Core per device. 

So there are now several questions. 

1) Can I force ext3 to use larger I/O's also ? 

2) Is my assumtion correct that the cause for this 
issue can be a accumulated serialized latency issue ? 

e.g. 

  - ext3 splits I/O's in units of 4k and adds them 
to the device mapper 

  - first device mapper target in the stack 
receives the requests (in this case this is 
dm_crypt)
  - dm_crypt encrypts each 4k block individually 
and serially (because of single workqueue) and adds them to the lower level 
device (in this case sdf) - this adds up the latency addition (10 x 4k blocks = 
+ 
1ms).
  - sdf queue merges the requests (not as 
efficient anymore (55169  vs 44052 writes) 

  - sdf sends the I/O's to the backend sevice 

Especially the step 3 adds "significant" latency to the 
procedure to slow down the process considerably. 

Can this be the reason ? 

3) for direct I/O, request size is flexible, thus 
database workloads should not see major impact on performance (~10%) until the 
per device CPU limit is hit - is this a correct assumption ? 

4) Cached I/O can be slowed down considerably, also if 
the average I/O rate is below the CPU limit due to latency multiplication - Is 
this a correct assumption ? 

It would be great if you would help me understand this 
issue 
:)

Regards, 

Robert 

Mit freundlichen 
Grüßen / Kind Regards
Robert 
Heinzmann

_______________________________________________
dm-crypt mailing list
dm-crypt@xxxxxxxx
http://www.saout.de/mailman/listinfo/dm-crypt