Hello list,
I'm currently testig
some dm_crypt performance on CentOS 5.4 (AMD64, Kernel 2.6.21.7) One part
of the evaluation is a simple end-2-end disk benchmark. I use the Kernel for
this. The routine is:
0)
Create filesystem on the backend device and mount
-- record
starttime
1)
Extract Kernel (2.6.37)
2) Copy
the extracted kernel once
3) Copy the extracted kernel
once
4) Copy the
extracted kernel once
5) Remove the copy
6) Remove the copy
7) Remove the copy
8) Remove the
original
9) call
sync
-- record endtime
I do this for plain ext3 filesystem on /dev/sdf and
ext3 on crypto filesystem (I use dm_crypt and truecrypt as backends).
I found that if I run this Test with a ext3 filesystem
created in step (0), performance drops ~40% for dm_crypt (runtime is 40 %
longer). I also see that the cpu load is not increased, only I/O wait increases,
but not the overall cpu usage. For me this looks like a latency issue, because
stil a lot of CPU is free (actually mostly all of it). If I do a /proc/diskstat
before and after the test, I see that I/O time is much longer in dm_crypt.
a) /proc/diskstat for one run with plain disk backend:
DISK sdf: reads=101 rmerge=0 rsect=808
rtimems=6366 writes=44052 wmerge=372381 wsect=3321608 wtimems=4993178
current=158 iotime=36041
iotimeweighted=5006161
b) /proc/diskstat for one run with dm_crypt disk
backend (dm-1 is the dm_crypt):
DISK sdf: reads=110 rmerge=0 rsect=880
rtimems=5084 writes=55169 wmerge=421079 wsect=3809984 wtimems=5275315 current=0
iotime=42127 iotimeweighted=5280393
DISK dm-1: reads=110 rmerge=0 rsect=880 rtimems=13457 writes=476248 wmerge=0 wsect=3809984 wtimems=1467043364 current=0 iotime=42602 iotimeweighted=1467056820
DISK dm-1: reads=110 rmerge=0 rsect=880 rtimems=13457 writes=476248 wmerge=0 wsect=3809984 wtimems=1467043364 current=0 iotime=42602 iotimeweighted=1467056820
I also see that the disk writes are split in 4k:
for a) the average write size is 3281192 / 48726 =
67,33 = ~34 KB per
Write
for b) it is dm-1 with ext3: Average Write Size =
3809984 / 476248 = 8 = 4k per write.
Because this is cached I/O (buffer cache), dm_requests
are processed in units of 4k (somehow this seems to be a implementation specific
thing).Then this small requests are merged again in the scheduler for the
/dev/sdf backend device. I would expect that this is not such a big issue.
I tested the system and it can to 80 MB/sec encryption / decryption. The
backend does ~50 MB/s writes. So I was expecting a performance impact of ~10
percent. However it seems to be much more (40%).
When I run the same test with ext2: I see a average
request size of 22 Sectors (~11k) to the dm_device and merges on the
backend (46 vs 42 seconds ~ 10% perf.
impact):
With XFS the same (95 seconds instead of 87 ~10% perf.
impact, avg I/O size 34 Sectors). Only ext3, is fixed at 4k
requests.
I also measured via a dm_zero backend the latency
impact of dm_crypt. It seems it adds ~0.1ms latency to the I/Os for small I/Os
(4/8k) up to 10ms for 1M (I/O). It also looks like dm_crypt does only scale to
one CPU Core per device.
So there are now several questions.
1) Can I force ext3 to use larger I/O's also ?
2) Is my assumtion correct that the cause for this
issue can be a accumulated serialized latency issue ?
e.g.
- ext3 splits I/O's in units of 4k and adds them
to the device mapper
- first device mapper target in the stack
receives the requests (in this case this is
dm_crypt)
- dm_crypt encrypts each 4k block individually
and serially (because of single workqueue) and adds them to the lower level
device (in this case sdf) - this adds up the latency addition (10 x 4k blocks =
+
1ms).
- sdf queue merges the requests (not as
efficient anymore (55169 vs 44052 writes)
- sdf sends the I/O's to the backend sevice
Especially the step 3 adds "significant" latency to the
procedure to slow down the process considerably.
Can this be the reason ?
3) for direct I/O, request size is flexible, thus
database workloads should not see major impact on performance (~10%) until the
per device CPU limit is hit - is this a correct assumption ?
4) Cached I/O can be slowed down considerably, also if
the average I/O rate is below the CPU limit due to latency multiplication - Is
this a correct assumption ?
It would be great if you would help me understand this
issue
:)
Regards,
Robert
Mit freundlichen Grüßen / Kind Regards
Robert
Heinzmann
_______________________________________________ dm-crypt mailing list dm-crypt@xxxxxxxx http://www.saout.de/mailman/listinfo/dm-crypt